# *Intro to Pandas*

## Pandas Facts:
* Pandas is a Python package used for managing data.
* Pandas creates 2 new data types for storing data: series and dataframe.
* A Pandas dataframe is like an excel spreadsheet.  One column can have customer name, one column can have product sold name, another column can have price or quantity. In this example, the rows could be individual sales.
* A dataframe in Pandas is made up of one or more series.  
* Each column of a dataframe is a series.
* Each column and row in a Pandas dataframe can be given a title. 
* A pandas dataframe is very similar to a data.frame in R.
* Similar to NumPy arrays, a Pandas dataframe is a more robust data type for storing data than using a lists of lists. Dataframes are also more flexible than NumPy arrays.
* A NumPy array can create a matrix with all entries of the same data type.  In a dataframe each column can have its own datatype.  
* Pandas also has SQL-like functions for merging, joining, and sorting dataframes.

In [1]:
# In general it's good practice to import all pacakages at the beginning of your code

import pandas as pd
import numpy as np  

# Pandas - Series

In [2]:
# A series in Pandas is the equivalent of a single column of a spreadsheet.
# An entry in a series corresponds to an individual row in the spreadsheet
# A series can be created by converting a list or numpy array to a series.

my_list = [5.4, 6.1, 1.7, 99.8]
my_array = np.array(my_list)

In [3]:
my_series1 = pd.Series(data = my_list)
print(my_series1)

my_series2 = pd.Series(data = my_array)
print(my_series2)

0     5.4
1     6.1
2     1.7
3    99.8
dtype: float64
0     5.4
1     6.1
2     1.7
3    99.8
dtype: float64


In [4]:
# Individual entries are accessed the same way as with lists and arrays:

print(my_series1[2])

1.7


In [5]:
# Labels can be added to the entries of a series

my_labels = ['first', 'second', 'third', 'fourth']
my_series3 = pd.Series(data = my_list, index = my_labels)
print(my_series3)

first      5.4
second     6.1
third      1.7
fourth    99.8
dtype: float64


In [6]:
# We do not need to be explicit about the entries of pd.Series:

my_series4 = pd.Series(my_list, my_labels)
print(my_series4)

first      5.4
second     6.1
third      1.7
fourth    99.8
dtype: float64


In [7]:
# We can also access entries using the index labels:

print(my_series4['second'])

6.1


In [8]:
# We can do math on series:

my_series5 = pd.Series([5.5,1.1,8.8,1.6],['first', 'third', 'fourth', 'fifth'])
print(my_series5)
print('')
print(my_series5 + my_series4)

first     5.5
third     1.1
fourth    8.8
fifth     1.6
dtype: float64

fifth       NaN
first      10.9
fourth    108.6
second      NaN
third       2.8
dtype: float64


In [9]:
# We can combine series to create a dataframe using the concat function:

df1 = pd.concat([my_series4, my_series5], axis = 1, sort = False)
df1

Unnamed: 0,0,1
first,5.4,5.5
second,6.1,
third,1.7,1.1
fourth,99.8,8.8
fifth,,1.6


In [10]:
# We can create a new dataframe:

df2 = pd.DataFrame(np.random.randn(5, 5))
df2

Unnamed: 0,0,1,2,3,4
0,-0.413808,-1.221596,0.969325,0.278361,-0.336226
1,-0.351065,-0.709211,-0.784765,0.561655,-1.080151
2,1.632429,0.03109,-1.416508,1.29068,-2.033203
3,1.083881,0.203596,-0.375386,-0.549303,-0.923695
4,0.164363,3.119962,-1.237944,-0.221878,0.384138


In [11]:
# lets give labels to rows and columns
df3 = pd.DataFrame(np.random.randn(5,5), index = ['first row', 'second row', 'third row', 'fourth row', 'fifth row'],
                   columns = ['first col', 'second col', 'third col', 'fourth col', 'fifth col'])
df3

Unnamed: 0,first col,second col,third col,fourth col,fifth col
first row,-1.465796,0.024343,-0.846782,0.467769,0.259193
second row,0.298662,0.711735,1.45642,0.042725,1.337166
third row,-0.176285,-1.398933,-0.593307,-0.68666,0.722831
fourth row,0.086684,0.090902,-1.471326,-0.460387,1.626224
fifth row,-0.540244,-0.658691,1.323381,0.264558,0.478201


In [12]:
# We can access individual series in a data frame: 

print(df3['second col'])
print('')
df3[['third col', 'first col']]

first row     0.024343
second row    0.711735
third row    -1.398933
fourth row    0.090902
fifth row    -0.658691
Name: second col, dtype: float64



Unnamed: 0,third col,first col
first row,-0.846782,-1.465796
second row,1.45642,0.298662
third row,-0.593307,-0.176285
fourth row,-1.471326,0.086684
fifth row,1.323381,-0.540244


In [13]:
# We can access rows of a dataframe:

df3.loc['fourth row']

first col     0.086684
second col    0.090902
third col    -1.471326
fourth col   -0.460387
fifth col     1.626224
Name: fourth row, dtype: float64

In [14]:
df3.iloc[2]

first col    -0.176285
second col   -1.398933
third col    -0.593307
fourth col   -0.686660
fifth col     0.722831
Name: third row, dtype: float64

In [15]:
df3.loc[['fourth row', 'first row'], ['second col', 'third col']]

Unnamed: 0,second col,third col
fourth row,0.090902,-1.471326
first row,0.024343,-0.846782


In [16]:
# We can use logical indexing for dataframes just like for numpy arrays:

df3 > 0

Unnamed: 0,first col,second col,third col,fourth col,fifth col
first row,False,True,False,True,True
second row,True,True,True,True,True
third row,False,False,False,False,True
fourth row,True,True,False,False,True
fifth row,False,False,True,True,True


In [17]:
print(df3[df3 > 0])

            first col  second col  third col  fourth col  fifth col
first row         NaN    0.024343        NaN    0.467769   0.259193
second row   0.298662    0.711735   1.456420    0.042725   1.337166
third row         NaN         NaN        NaN         NaN   0.722831
fourth row   0.086684    0.090902        NaN         NaN   1.626224
fifth row         NaN         NaN   1.323381    0.264558   0.478201


In [18]:
# We can add columns to a dataframe:

df3['sixth col'] = np.random.randn(5, 1)
df3

Unnamed: 0,first col,second col,third col,fourth col,fifth col,sixth col
first row,-1.465796,0.024343,-0.846782,0.467769,0.259193,-1.615621
second row,0.298662,0.711735,1.45642,0.042725,1.337166,-0.410928
third row,-0.176285,-1.398933,-0.593307,-0.68666,0.722831,0.662315
fourth row,0.086684,0.090902,-1.471326,-0.460387,1.626224,1.501315
fifth row,-0.540244,-0.658691,1.323381,0.264558,0.478201,-1.043483


In [23]:
# We can remove columns or rows from a dataframe:

df3.drop('first col', axis = 1)

KeyError: "['first col'] not found in axis"

In [20]:
df3

Unnamed: 0,second col,third col,fourth col,fifth col,sixth col
first row,0.024343,-0.846782,0.467769,0.259193,-1.615621
second row,0.711735,1.45642,0.042725,1.337166,-0.410928
third row,-1.398933,-0.593307,-0.68666,0.722831,0.662315
fourth row,0.090902,-1.471326,-0.460387,1.626224,1.501315
fifth row,-0.658691,1.323381,0.264558,0.478201,-1.043483


In [21]:
df4 = df3.drop('first col', axis = 1)
df4

KeyError: "['first col'] not found in axis"

In [24]:
df4

NameError: name 'df4' is not defined

In [None]:
df5 = df3.drop('second row', axis = 0)
df5

In [None]:
# We can remove a dataframe's index labels:

df5.reset_index()

In [None]:
df5

In [None]:
df5.reset_index(inplace = True)
df5

In [None]:
# We can assign new names to the index:

df5['new name'] = ['This','is','the','row']
df5

In [None]:
df5.set_index('new name',inplace=True)
df5