## Series

Series is the first data type we will go through in this tutorial. A series is very similar to an array except that series can have index labels, meaning that it can be indexed by a label, instead of just a number location. In addition, it does not need to hold any numeric data, it can hold any arbitrary object.

In [1]:
import numpy as np
import pandas as pd

In [2]:
labels = ['a','b','c']
my_list = [10,20,30]
my_arr = np.array([10,20,30])
my_d = {'a':10,'b':20,'c':30}

### Creating series:

- Using lists
- Using NumPy arrays
- Using dictionaries

#### Using Lists

In [3]:
pd.Series(my_list)

0    10
1    20
2    30
dtype: int64

In [4]:
# adding index labels

pd.Series(my_list, index=labels)

a    10
b    20
c    30
dtype: int64

#### Using Array

In [5]:
pd.Series(my_arr, labels)

a    10
b    20
c    30
dtype: int64

#### Using Dictionary

In [6]:
pd.Series(my_d)

a    10
b    20
c    30
dtype: int64

### Data other than numbers

A pandas Series can hold a variety of object types:

In [7]:
pd.Series(labels)

0    a
1    b
2    c
dtype: object

In [8]:
# a series can even hold functions

pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

### Using Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information.

In [9]:
ser1 = pd.Series([1,2,3,4], index = ['USA', 'Germany','USSR', 'Japan'])
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [10]:
ser2 = pd.Series([1,2,5,4], index = ['USA', 'Germany','Italy', 'Japan'])
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [16]:
display(ser1['USSR'], ser2['Italy'])

3

5

In [11]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

## DataFrames

We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [20]:
import numpy as np
from numpy.random import randn
np.random.seed(101)

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


### Selection and Indexing

In [15]:
# select based on one column

df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

In [16]:
# select multiple columns

df[['W','Z']]

Unnamed: 0,W,Z
A,2.70685,0.503826
B,0.651118,0.605965
C,-2.018168,-0.589001
D,0.188695,0.955057
E,0.190794,0.683509


In [17]:
# SQL Syntax

df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

### Creating a new column

In [19]:
df['new'] = df['W'] + df['Z']
df['new']

A    3.210676
B    1.257083
C   -2.607169
D    1.143752
E    0.874303
Name: new, dtype: float64

### Removing a column

In [21]:
# use inplace 

df.drop('new', axis=1, inplace=True)

In [22]:
# can also use to drop rows

df.drop('E', axis=0)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057


### Selecting Rows

In [23]:
# two methods: using index or index position

display(df.loc['A'], df.iloc[0])

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

### Selecting subset of rows and columns

In [24]:
df.loc['B','Y']

-0.8480769834036315

In [25]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,2.70685,0.907969
B,0.651118,-0.848077


### Conditional Selection

In [26]:
df>0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,True,False,False,True
C,False,True,True,False
D,True,False,False,True
E,True,True,True,True


In [27]:
df[df>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,,,0.605965
C,,0.740122,0.528813,
D,0.188695,,,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [28]:
df[df['W']>0]

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


In [30]:
df[df['W']>0][['W','X']]

Unnamed: 0,W,X
A,2.70685,0.628133
B,0.651118,-0.319318
D,0.188695,-0.758872
E,0.190794,1.978757


In [32]:
df[(df['W']>0) & (df['Y']>1)]

Unnamed: 0,W,X,Y,Z
E,0.190794,1.978757,2.605967,0.683509


### Creating new columns

In [22]:
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df

Unnamed: 0,W,X,Y,Z,States
A,2.70685,0.628133,0.907969,0.503826,CA
B,0.651118,-0.319318,-0.848077,0.605965,NY
C,-2.018168,0.740122,0.528813,-0.589001,WY
D,0.188695,-0.758872,-0.933237,0.955057,OR
E,0.190794,1.978757,2.605967,0.683509,CO


In [23]:
df['New States'] = df['States'].map({'CA':0, 'NY':1, 'WY':2, 'OR':3, 'CO':4})
df

Unnamed: 0,W,X,Y,Z,States,New States
A,2.70685,0.628133,0.907969,0.503826,CA,0
B,0.651118,-0.319318,-0.848077,0.605965,NY,1
C,-2.018168,0.740122,0.528813,-0.589001,WY,2
D,0.188695,-0.758872,-0.933237,0.955057,OR,3
E,0.190794,1.978757,2.605967,0.683509,CO,4


### More Index Details

In [33]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,2.70685,0.628133,0.907969,0.503826
1,B,0.651118,-0.319318,-0.848077,0.605965
2,C,-2.018168,0.740122,0.528813,-0.589001
3,D,0.188695,-0.758872,-0.933237,0.955057
4,E,0.190794,1.978757,2.605967,0.683509


In [36]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
CA,2.70685,0.628133,0.907969,0.503826
NY,0.651118,-0.319318,-0.848077,0.605965
WY,-2.018168,0.740122,0.528813,-0.589001
OR,0.188695,-0.758872,-0.933237,0.955057
CO,0.190794,1.978757,2.605967,0.683509


### DataFrame Summaries

In [37]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,5.0,5.0,5.0,5.0
mean,0.343858,0.453764,0.452287,0.431871
std,1.681131,1.061385,1.454516,0.594708
min,-2.018168,-0.758872,-0.933237,-0.589001
25%,0.188695,-0.319318,-0.848077,0.503826
50%,0.190794,0.628133,0.528813,0.605965
75%,0.651118,0.740122,0.907969,0.683509
max,2.70685,1.978757,2.605967,0.955057


In [38]:
df.dtypes

W         float64
X         float64
Y         float64
Z         float64
States     object
dtype: object

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, A to E
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   W       5 non-null      float64
 1   X       5 non-null      float64
 2   Y       5 non-null      float64
 3   Z       5 non-null      float64
 4   States  5 non-null      object 
dtypes: float64(4), object(1)
memory usage: 412.0+ bytes


## Missing Data

In [40]:
df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [41]:
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [42]:
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [43]:
df.dropna(thresh=2, axis=1)

Unnamed: 0,A,C
0,1.0,1
1,2.0,2
2,,3


In [44]:
df.fillna('YES!')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,YES!,2
2,YES!,YES!,3


In [45]:
df['A'].fillna(df.A.mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

## Groupby

The groupby method allows you to group rows of data together and call aggregate functions

In [4]:
import pandas as pd

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
       'Sales':[200,120,340,124,243,350]}

df = pd.DataFrame(data)
df

Unnamed: 0,Company,Person,Sales
0,GOOG,Sam,200
1,GOOG,Charlie,120
2,MSFT,Amy,340
3,MSFT,Vanessa,124
4,FB,Carl,243
5,FB,Sarah,350


In [6]:
# Now we can use the .groupby() method to group rows together based off of a column name
# This will create a DataFrameGroupBy object:

df.groupby('Company')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe7bc361cd0>

In [7]:
# we can then call aggregate methods off the object:

df.groupby('Company').mean()

Unnamed: 0_level_0,Sales
Company,Unnamed: 1_level_1
FB,296.5
GOOG,160.0
MSFT,232.0


In [None]:
# more aggregate methods

df.groupby('Company').std()
df.groupby('Company').min()
df.groupby('Company').count()
df.groupby('Company').describe()
df.groupby('Company').describe().transpose()
df.groupby('Company').describe().transpose()['GOOG']

## Operations

In [24]:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


### Info on Unique Values

In [9]:
df['col2'].unique()

array([444, 555, 666])

In [10]:
df['col2'].nunique()

3

In [11]:
df['col2'].value_counts()

444    2
666    1
555    1
Name: col2, dtype: int64

In [17]:
df.col2.idxmax()

2

### Applying Functions

In [12]:
def times2(x):
    return x*2

In [13]:
df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [15]:
df['col3'].apply(len)

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

In [26]:
df['new col3'] = df['col3'].apply(lambda x:x[0])
df

Unnamed: 0,col1,col2,col3,new col3
0,1,444,abc,a
1,2,555,def,d
2,3,666,ghi,g
3,4,444,xyz,x
