# NumPy

**Numeric Python** is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

The First Rule of NumPy

In [1]:
import numpy as np

The Second Rule of NumPy

In [3]:
# You don't need cycles

first_arr = np.array([1, 2, 3, 4, 5])
second_arr = np.copy(first_arr)

# Instead of
for i in range(len(first_arr)):
    if first_arr[i] == 3 or first_arr[i] == 4:
        first_arr[i] = 0
# Do
second_arr[(second_arr == 4) | (second_arr == 3)] = 0

assert((first_arr == second_arr).all())

Array creation

In [4]:
a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 8])
print(type(a))
print(a.shape)
print(a.dtype)

<class 'numpy.ndarray'>
(9,)
int32


Assigning and appending an element to an array

In [5]:
a[8] = 9
a = np.append(a, 10)
print(a)

[ 1  2  3  4  5  6  7  8  9 10]


Standard Python slicing syntax

In [6]:
print(a)
print('\n')

print(a[0:5])
print(a[0:5:2])
print(a[0:-1])
print(a[4::-1])
print(a[5:0:-2])

[ 1  2  3  4  5  6  7  8  9 10]


[1 2 3 4 5]
[1 3 5]
[1 2 3 4 5 6 7 8 9]
[5 4 3 2 1]
[6 4 2]


In-built calculation of different statistics

In [7]:
print('Vector max %d, min %d, mean %.2f, median %.2f, stardard deviation %.2f and total sum %d' %
      (a.max(), np.min(a), a.mean(), np.median(a), a.std(), a.sum()))

Vector max 10, min 1, mean 5.50, median 5.50, stardard deviation 2.87 and total sum 55


Filtering on condition (masking)

In [8]:
a[a > a.mean()]

array([ 6,  7,  8,  9, 10])

Sorting

In [9]:
# Sorted array
print(np.sort(a))

# Order of indices in sorted array
print(np.argsort(a))

[ 1  2  3  4  5  6  7  8  9 10]
[0 1 2 3 4 5 6 7 8 9]


Vector operations

In [10]:
               a = np.array([1, 2, 3])
b = np.array([2, 3, 4])
a * b
a - b
a + b

array([3, 5, 7])

2D arrays (matrices)

In [11]:
m_a = np.array([[1, 2, 3, 4]
                ,[13, 3, 8, 2]
                ,[8, 7, 2, 3]])
print(m_a.shape)

(3, 4)


Statistics calculation

In [12]:
print(m_a.max())
print(m_a.max(axis=0))
print(m_a.max(axis=1))

13
[13  7  8  4]
[ 4 13  8]


Sorting

In [13]:
print(np.sort(m_a, axis=0))
print(np.sort(m_a, axis=1))

[[ 1  2  2  2]
 [ 8  3  3  3]
 [13  7  8  4]]
[[ 1  2  3  4]
 [ 2  3  8 13]
 [ 2  3  7  8]]


# Intro to Pandas data structures

In [15]:
import pandas as pd

## Series

[Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [16]:
s = pd.Series(data=[39.4, 91.2, 80.5, 20.3, 4.2, -13.4]
              ,index=['first', 'second', 'second', 'third', 'forth', 'fifth'])
print(type(s))
print(s.shape)
print(s.dtype)
print(s['second'])

<class 'pandas.core.series.Series'>
(6,)
float64
second    91.2
second    80.5
dtype: float64


In [17]:
s = pd.Series(data=[39.4, 91.2, 20.3, 4.2, -13.4])
s

0    39.4
1    91.2
2    20.3
3     4.2
4   -13.4
dtype: float64

Series acts very similarly to a [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html), and is a valid argument to most [NumPy](https://numpy.org/doc/stable/user/whatisnumpy.html) functions. However, operations such as slicing will also slice the index.

In [18]:
s[1:4]

1    91.2
2    20.3
3     4.2
dtype: float64

In-built statistics calculation is the same as in NumPy

In [19]:
np.max(s)
s.min()
s.std()

40.19736309759634

Vector operations

In [20]:
a = pd.Series([1, 2, 3])
b = pd.Series([2, 3, 4])

a * 2
a + b
a - b
a * b

0     2
1     6
2    12
dtype: int64

## DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

DataFrame creation

In [21]:
d = {"one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
     "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
   }
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [22]:
d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
0,1.0,4.0
1,2.0,3.0
2,3.0,2.0
3,4.0,1.0


### Basic transformations

Loading existing DataFrame from csv file. We'll use Google Play Store Apps dataset from here https://www.kaggle.com/lava18/google-play-store-apps.

In [23]:
data = pd.read_csv('googleplaystore.csv')
data.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


Column types

In [24]:
data.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

Column selection as DataFrame

In [25]:
data[['App', 'Rating']].head()

Unnamed: 0,App,Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1
1,Coloring book moana,3.9
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",4.7
3,Sketch - Draw & Paint,4.5
4,Pixel Draw - Number Art Coloring Book,4.3


Column selection as Series. You can treat a DataFrame semantically like a dict of like-indexed Series objects.

In [26]:
data['Rating'].head()

0    4.1
1    3.9
2    4.7
3    4.5
4    4.3
Name: Rating, dtype: float64

In [27]:
type(data['Rating'])

pandas.core.series.Series

In [28]:
data['Rating'].mean()

4.193338315362448

Row selection by index

In [29]:
data.iloc[1]

App                     Coloring book moana
Category                     ART_AND_DESIGN
Rating                                  3.9
Reviews                                 967
Size                                    14M
Installs                           500,000+
Type                                   Free
Price                                     0
Content Rating                     Everyone
Genres            Art & Design;Pretend Play
Last Updated               January 15, 2018
Current Ver                           2.0.0
Android Ver                    4.0.3 and up
Name: 1, dtype: object

Filtering (row selection by condition)

In [30]:
data[data['Rating'] < 3].head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
477,Calculator,DATING,2.6,57,6.2M,"1,000+",Paid,$6.99,Everyone,Dating,"October 25, 2017",1.1.6,4.0 and up
518,Just She - Top Lesbian Dating,DATING,1.9,953,19M,"100,000+",Free,0,Mature 17+,Dating,"July 18, 2018",6.3.7,5.0 and up
520,EliteSingles – Dating for Single Professionals,DATING,2.5,5377,19M,"500,000+",Free,0,Mature 17+,Dating,"July 31, 2018",4.8.5,4.0.3 and up
527,Sugar Daddy Dating App,DATING,2.5,277,5.7M,"100,000+",Free,0,Mature 17+,Dating,"December 4, 2017",3.0.0,4.1 and up
528,Adult Dirty Emojis,DATING,2.8,80,5.5M,"10,000+",Free,0,Teen,Dating,"November 6, 2017",1.0,4.0.3 and up


Assigning value to column based on condition

In [None]:
# Not correct
data[data['Rating'] < 3]['Rating'] = 0

In [None]:
data[data['Rating'] < 3].head(3)

In [None]:
# Correct
data.loc[data['Rating'] < 3, 'Rating'] = 0

In [None]:
data[data['Rating'] < 3].head(3)

Sorting

In [None]:
data.sort_values(by='Rating').head(3)

Selecting values

In [None]:
data.head()['App'].values.tolist()

**All together**. Make list of top-5 Free Apps by Rating in Education Genres by alphabet order

In [None]:
data[(data['Type'] == 'Free') & (data['Genres'] == 'Education')]\
    .sort_values(by=['Rating', 'App'], ascending=(False, True))\
    .head(5)['App'].values.tolist()

### Concatenating

Appending DataFrames

In [None]:
df1 = pd.DataFrame({
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"],},
    index=[0, 1, 2, 3],)

df2 = pd.DataFrame({
    "A": ["A4", "A5", "A6", "A7"],
    "B": ["B4", "B5", "B6", "B7"],
    "C": ["C4", "C5", "C6", "C7"],
    "E": ["E4", "E5", "E6", "E7"],},
    index=[0, 1, 2, 3],)

In [None]:
df1

In [None]:
df2

In [None]:
df1.append(df2, ignore_index=True)

**Join** methon works better with joining DataFrame by indices and is fine-tuned by default to do it.

In [None]:
df3 = pd.DataFrame({
    "A": ["A1", "A2", "A3", "A4"],
    "F": ["F0", "F1", "F2", "F3"],},
    index=[0, 1, 2, 3],)

In [None]:
df1

In [None]:
df3

In [None]:
df1.join(df3, how='inner', lsuffix='_first', rsuffix='_third')

**Merge** method is more versatile and allows us to specify columns besides the index to join on for both dataframes.

In [None]:
df1.merge(df3, how='left', on=['A'])

### Grouping

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups

In [None]:
data.head(3)

Find _Category_ having the highest average rating among it's applications. No cycles, I promise.

In [None]:
data.groupby('Category')['Rating'].mean().sort_values(ascending=False).index[1]