# *Intro to Pandas*

## Pandas Facts:
* Pandas is a Python package used for managing data.
* Pandas creates 2 new data types for storing data: series and dataframe.
* A Pandas dataframe is like an excel spreadsheet.  One column can have customer name, one column can have product sold name, another column can have price or quantity. In this example, the rows could be individual sales.
* A dataframe in Pandas is made up of one or more series.  
* Each column of a dataframe is a series.
* Each column and row in a Pandas dataframe can be given a title. 
* A pandas dataframe is very similar to a data.frame in R.
* Similar to NumPy arrays, a Pandas dataframe is a more robust data type for storing data than using a lists of lists. Dataframes are also more flexible than NumPy arrays.
* A NumPy array can create a matrix with all entries of the same data type.  In a dataframe each column can have its own datatype.  
* Pandas also has SQL-like functions for merging, joining, and sorting dataframes.

In [1]:
# In general it's good practice to import all pacakages at the beginning of your code

import pandas as pd
import numpy as np  

# Pandas - Series

In [2]:
# A series in Pandas is the equivalent of a single column of a spreadsheet.
# An entry in a series corresponds to an individual row in the spreadsheet
# A series can be created by converting a list or numpy array to a series.

my_list = [5.4, 6.1, 1.7, 99.8]
my_array = np.array(my_list)

In [3]:
my_series1 = pd.Series(data = my_list)
print(my_series1)

my_series2 = pd.Series(data = my_array)
print(my_series2)

0     5.4
1     6.1
2     1.7
3    99.8
dtype: float64
0     5.4
1     6.1
2     1.7
3    99.8
dtype: float64


In [4]:
# Individual entries are accessed the same way as with lists and arrays:

print(my_series1[2])

1.7


In [5]:
# Labels can be added to the entries of a series

my_labels = ['first', 'second', 'third', 'fourth']
my_series3 = pd.Series(data = my_list, index = my_labels)
print(my_series3)

first      5.4
second     6.1
third      1.7
fourth    99.8
dtype: float64


In [6]:
# We do not need to be explicit about the entries of pd.Series:

my_series4 = pd.Series(my_list, my_labels)
print(my_series4)

first      5.4
second     6.1
third      1.7
fourth    99.8
dtype: float64


In [7]:
# We can also access entries using the index labels:

print(my_series4['second'])

6.1


In [8]:
# We can do math on series:

my_series5 = pd.Series([5.5,1.1,8.8,1.6],['first', 'third', 'fourth', 'fifth'])
print(my_series5)
print('')
print(my_series5 + my_series4)

first     5.5
third     1.1
fourth    8.8
fifth     1.6
dtype: float64

fifth       NaN
first      10.9
fourth    108.6
second      NaN
third       2.8
dtype: float64


In [9]:
# We can combine series to create a dataframe using the concat function:

df1 = pd.concat([my_series4, my_series5], axis = 1, sort = False)
df1

Unnamed: 0,0,1
first,5.4,5.5
second,6.1,
third,1.7,1.1
fourth,99.8,8.8
fifth,,1.6


In [10]:
# We can create a new dataframe:

df2 = pd.DataFrame(np.random.randn(5, 5))
df2

Unnamed: 0,0,1,2,3,4
0,-0.524415,0.687942,-1.136954,-0.747411,-0.471651
1,1.079324,0.084823,0.502309,0.949654,-1.550163
2,0.86988,0.536724,0.131013,0.785262,-0.990282
3,1.022086,0.684115,0.270509,-0.073757,-0.497938
4,-0.950188,-0.414689,1.399113,0.226544,-0.36202


In [11]:
# lets give labels to rows and columns
df3 = pd.DataFrame(np.random.randn(5,5), index = ['first row', 'second row', 'third row', 'fourth row', 'fifth row'],
                   columns = ['first col', 'second col', 'third col', 'fourth col', 'fifth col'])
df3

Unnamed: 0,first col,second col,third col,fourth col,fifth col
first row,0.357671,0.504707,-0.713978,-0.633985,0.090022
second row,0.057252,1.398727,0.186526,-1.191709,-0.811977
third row,0.687815,-0.565296,0.686173,-0.355672,1.494404
fourth row,-1.407497,0.097052,0.591261,-0.575985,0.219891
fifth row,0.244088,-0.17052,-0.876439,-0.806833,0.046434


In [12]:
# We can access individual series in a data frame: 

print(df3['second col'])
print('')
df3[['third col', 'first col']]

first row     0.504707
second row    1.398727
third row    -0.565296
fourth row    0.097052
fifth row    -0.170520
Name: second col, dtype: float64



Unnamed: 0,third col,first col
first row,-0.713978,0.357671
second row,0.186526,0.057252
third row,0.686173,0.687815
fourth row,0.591261,-1.407497
fifth row,-0.876439,0.244088


In [13]:
# We can access rows of a dataframe:

df3.loc['fourth row']

first col    -1.407497
second col    0.097052
third col     0.591261
fourth col   -0.575985
fifth col     0.219891
Name: fourth row, dtype: float64

In [14]:
df3.iloc[2]

first col     0.687815
second col   -0.565296
third col     0.686173
fourth col   -0.355672
fifth col     1.494404
Name: third row, dtype: float64

In [15]:
df3.loc[['fourth row', 'first row'], ['second col', 'third col']]

Unnamed: 0,second col,third col
fourth row,0.097052,0.591261
first row,0.504707,-0.713978


In [16]:
# We can use logical indexing for dataframes just like for numpy arrays:

df3 > 0

Unnamed: 0,first col,second col,third col,fourth col,fifth col
first row,True,True,False,False,True
second row,True,True,True,False,False
third row,True,False,True,False,True
fourth row,False,True,True,False,True
fifth row,True,False,False,False,True


In [17]:
print(df3[df3 > 0])

            first col  second col  third col  fourth col  fifth col
first row    0.357671    0.504707        NaN         NaN   0.090022
second row   0.057252    1.398727   0.186526         NaN        NaN
third row    0.687815         NaN   0.686173         NaN   1.494404
fourth row        NaN    0.097052   0.591261         NaN   0.219891
fifth row    0.244088         NaN        NaN         NaN   0.046434


In [18]:
# We can add columns to a dataframe:

df3['sixth col'] = np.random.randn(5, 1)
df3

Unnamed: 0,first col,second col,third col,fourth col,fifth col,sixth col
first row,0.357671,0.504707,-0.713978,-0.633985,0.090022,-0.328494
second row,0.057252,1.398727,0.186526,-1.191709,-0.811977,-1.288178
third row,0.687815,-0.565296,0.686173,-0.355672,1.494404,1.151327
fourth row,-1.407497,0.097052,0.591261,-0.575985,0.219891,0.152074
fifth row,0.244088,-0.17052,-0.876439,-0.806833,0.046434,-1.645416


In [19]:
# We can remove columns or rows from a dataframe:

df3.drop('first col', axis = 1, inplace = True)

In [20]:
df3

Unnamed: 0,second col,third col,fourth col,fifth col,sixth col
first row,0.504707,-0.713978,-0.633985,0.090022,-0.328494
second row,1.398727,0.186526,-1.191709,-0.811977,-1.288178
third row,-0.565296,0.686173,-0.355672,1.494404,1.151327
fourth row,0.097052,0.591261,-0.575985,0.219891,0.152074
fifth row,-0.17052,-0.876439,-0.806833,0.046434,-1.645416


In [21]:
df4 = df3.drop('first col', axis = 1)
df4

KeyError: "['first col'] not found in axis"

In [22]:
df5 = df3.drop('second row', axis = 0)
df5

Unnamed: 0,second col,third col,fourth col,fifth col,sixth col
first row,0.504707,-0.713978,-0.633985,0.090022,-0.328494
third row,-0.565296,0.686173,-0.355672,1.494404,1.151327
fourth row,0.097052,0.591261,-0.575985,0.219891,0.152074
fifth row,-0.17052,-0.876439,-0.806833,0.046434,-1.645416


In [23]:
# We can remove a dataframe's index labels:

df5.reset_index()

Unnamed: 0,index,second col,third col,fourth col,fifth col,sixth col
0,first row,0.504707,-0.713978,-0.633985,0.090022,-0.328494
1,third row,-0.565296,0.686173,-0.355672,1.494404,1.151327
2,fourth row,0.097052,0.591261,-0.575985,0.219891,0.152074
3,fifth row,-0.17052,-0.876439,-0.806833,0.046434,-1.645416


In [24]:
df5

Unnamed: 0,second col,third col,fourth col,fifth col,sixth col
first row,0.504707,-0.713978,-0.633985,0.090022,-0.328494
third row,-0.565296,0.686173,-0.355672,1.494404,1.151327
fourth row,0.097052,0.591261,-0.575985,0.219891,0.152074
fifth row,-0.17052,-0.876439,-0.806833,0.046434,-1.645416


In [25]:
df5.reset_index(inplace = True)
df5

Unnamed: 0,index,second col,third col,fourth col,fifth col,sixth col
0,first row,0.504707,-0.713978,-0.633985,0.090022,-0.328494
1,third row,-0.565296,0.686173,-0.355672,1.494404,1.151327
2,fourth row,0.097052,0.591261,-0.575985,0.219891,0.152074
3,fifth row,-0.17052,-0.876439,-0.806833,0.046434,-1.645416


In [26]:
# We can assign new names to the index:

df5['new name'] = ['This','is','the','row']
df5

Unnamed: 0,index,second col,third col,fourth col,fifth col,sixth col,new name
0,first row,0.504707,-0.713978,-0.633985,0.090022,-0.328494,This
1,third row,-0.565296,0.686173,-0.355672,1.494404,1.151327,is
2,fourth row,0.097052,0.591261,-0.575985,0.219891,0.152074,the
3,fifth row,-0.17052,-0.876439,-0.806833,0.046434,-1.645416,row


In [27]:
df5.set_index('new name',inplace=True)
df5

Unnamed: 0_level_0,index,second col,third col,fourth col,fifth col,sixth col
new name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
This,first row,0.504707,-0.713978,-0.633985,0.090022,-0.328494
is,third row,-0.565296,0.686173,-0.355672,1.494404,1.151327
the,fourth row,0.097052,0.591261,-0.575985,0.219891,0.152074
row,fifth row,-0.17052,-0.876439,-0.806833,0.046434,-1.645416
