
# Pandas


pandas is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python.

pandas build upon numpy and scipy providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures pandas provides are Series and DataFrames. After a brief introduction to these two data structures and data ingestion, the key features of pandas this notebook covers are:

. Generating descriptive statistics on data
. Data cleaning using built in pandas functions
. Frequent data operations for subsetting, filtering, insertion,      deletion and aggregation of data
. Merging multiple datasets using dataframes
. Working with timestamps and time-series data
# Additional Recommended Resources:

####. pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/
####. Python for Data Analysis by Wes McKinney
####. Python Data Science Handbook by Jake VanderPlas
     Let's get started with our first pandas notebook!

# Import labreries

In [1]:
import pandas as pd

## Introduction to pandas data structure

pandas has two main data structure it uses , mainly , Series and DataFrame

## Pandas Series

pandas series one dimensional labeled array

In [4]:
ser = pd.Series(data =[100,200,300,400,500], index =("tom","bob","nancy","dan","eric"))

ser

tom      100
bob      200
nancy    300
dan      400
eric     500
dtype: int64

In [5]:
ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [6]:
#  pandas is hetrogenous ulike numpy
ser = pd.Series(data =[100,"foo",300,"bar",500], index =("tom","bob","nancy","dan","eric"))

ser

tom      100
bob      foo
nancy    300
dan      bar
eric     500
dtype: object

In [9]:
ser["nancy"]

300

In [12]:
ser.loc[["nancy","bob"]]

nancy    300
bob      200
dtype: int64

In [13]:
ser[[4,3,1]]

eric    500
dan     400
bob     200
dtype: int64

In [14]:
"bob" in ser

True

In [15]:
ser.iloc[2]

300

In [16]:
ser = pd.Series(data =[100,200,300,400,500], index =("tom","bob","nancy","dan","eric"))

ser

tom      100
bob      200
nancy    300
dan      400
eric     500
dtype: int64

In [17]:
ser * 2

tom       200
bob       400
nancy     600
dan       800
eric     1000
dtype: int64

In [18]:
ser**2

tom       10000
bob       40000
nancy     90000
dan      160000
eric     250000
dtype: int64

In [19]:
ser[["nancy","eric"]]**2

nancy     90000
eric     250000
dtype: int64

## Pandas DataFrame

Pandas dataframe is a 2 dimensional labeled data structure

### create DataFrame from dictionary of Python Series

In [35]:
d = {"one" : pd.Series([100., 200., 300.],  index = ["apple", "ball", "clock"]), 
     "two" : pd.Series([111.,222.,333.,444.],  index = ["apple","ball","cerill","dancy"])}

In [36]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,444.0


In [37]:
df.index

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [39]:
df.columns

Index(['one', 'two'], dtype='object')

In [41]:
pd.DataFrame(d, index = ["dancy","ball","apple"])

Unnamed: 0,one,two
dancy,,444.0
ball,200.0,222.0
apple,100.0,111.0


In [44]:
pd.DataFrame(d, index = ["dancy","ball","apple"], columns =["two","five"])

Unnamed: 0,two,five
dancy,444.0,
ball,222.0,
apple,111.0,


### Create DataFrame from list of Python dictionary

In [48]:
data = [{"alex": 1, "joe": 2}, {"ema":5, "dora": 10, "alice": 20}]

In [49]:
pd.DataFrame(data)

Unnamed: 0,alex,joe,ema,dora,alice
0,1.0,2.0,,,
1,,,5.0,10.0,20.0


In [51]:
pd.DataFrame(data, index = ["orange","red"])

Unnamed: 0,alex,joe,ema,dora,alice
orange,1.0,2.0,,,
red,,,5.0,10.0,20.0


In [52]:
pd.DataFrame(data, columns =["joe","dora","alice"])

Unnamed: 0,joe,dora,alice
0,2.0,,
1,,10.0,20.0


### Basic DataFrame Operatinon

In [53]:
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,444.0


In [54]:
df["one"]

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [55]:
df["three"] = df["one"]* df["two"]

In [57]:
df

Unnamed: 0,one,two,three
apple,100.0,111.0,11100.0
ball,200.0,222.0,44400.0
cerill,,333.0,
clock,300.0,,
dancy,,444.0,


In [58]:
df["flag"] = df["one"] > 250

In [59]:
df

Unnamed: 0,one,two,three,flag
apple,100.0,111.0,11100.0,False
ball,200.0,222.0,44400.0,False
cerill,,333.0,,False
clock,300.0,,,True
dancy,,444.0,,False


In [60]:
three = df.pop("three")

In [61]:
three

apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64

In [62]:
df

Unnamed: 0,one,two,flag
apple,100.0,111.0,False
ball,200.0,222.0,False
cerill,,333.0,False
clock,300.0,,True
dancy,,444.0,False


In [63]:
del df["two"]

In [64]:
df

Unnamed: 0,one,flag
apple,100.0,False
ball,200.0,False
cerill,,False
clock,300.0,True
dancy,,False


In [65]:
df.insert(2,"copy_of_one", df["one"])
df

Unnamed: 0,one,flag,copy_of_one
apple,100.0,False,100.0
ball,200.0,False,200.0
cerill,,False,
clock,300.0,True,300.0
dancy,,False,


In [67]:
df["one_upper_half"] = df["one"][:2]
df

Unnamed: 0,one,flag,copy_of_one,one_upper_half
apple,100.0,False,100.0,100.0
ball,200.0,False,200.0,200.0
cerill,,False,,
clock,300.0,True,300.0,
dancy,,False,,


# Data Ingestion with Pandas

## Case Study: Movie Data Analysis

This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*

## Download the Data set

 looking at the files in this dataset using the UNIX command ls.

In [82]:
!ls ./movielens

'ls' is not recognized as an internal or external command,
operable program or batch file.


In [75]:
import sys

In [76]:
sys.path

['C:\\Users\\User\\Desktop\\Pythonude\\EDX_data_course',
 'C:\\Users\\User\\Anaconda3\\python37.zip',
 'C:\\Users\\User\\Anaconda3\\DLLs',
 'C:\\Users\\User\\Anaconda3\\lib',
 'C:\\Users\\User\\Anaconda3',
 '',
 'C:\\Users\\User\\Anaconda3\\lib\\site-packages',
 'C:\\Users\\User\\Anaconda3\\lib\\site-packages\\win32',
 'C:\\Users\\User\\Anaconda3\\lib\\site-packages\\win32\\lib',
 'C:\\Users\\User\\Anaconda3\\lib\\site-packages\\Pythonwin',
 'C:\\Users\\User\\Anaconda3\\lib\\site-packages\\IPython\\extensions',
 'C:\\Users\\User\\.ipython']

In [83]:
pwd

'C:\\Users\\User\\Desktop\\Pythonude\\EDX_data_course'

In [84]:
ls

 Volume in drive C has no label.
 Volume Serial Number is 8023-0B54

 Directory of C:\Users\User\Desktop\Pythonude\EDX_data_course

02/07/2020  04:38 PM    <DIR>          .
02/07/2020  04:38 PM    <DIR>          ..
05/20/2019  10:19 PM             6,148 .DS_Store
02/07/2020  01:50 PM    <DIR>          .ipynb_checkpoints
02/05/2020  02:26 PM            20,544 boolean indexing and datatypes .ipynb
02/07/2020  10:59 AM               132 data.txt
05/26/2017  02:38 PM                 0 Icon_
02/03/2020  02:31 PM            19,170 Intro to Data.ipynb
02/03/2020  04:45 PM             8,475 Intro to numpy.ipynb
02/07/2020  04:38 PM            39,300 Introduction to Panda.ipynb
05/21/2019  04:12 PM            29,840 Introduction to Pandas.ipynb
02/07/2020  04:11 PM    <DIR>          movielens
02/07/2020  11:23 AM            28,176 Numpy tutorial.ipynb
02/05/2020  04:25 PM         5,927,084 satellite image analysis using numpy.ipynb
02/03/2020  02:44 PM             1,790 Using unix commands in J

In [85]:
ls /mowielens


Invalid switch - "mowielens".


In [86]:
ls /movielens

Invalid switch - "movielens".


In [87]:
ls ./movielens

Invalid switch - "movielens".


In [88]:
!ls /movielens


'ls' is not recognized as an internal or external command,
operable program or batch file.


In [89]:
ls ./movielens

Invalid switch - "movielens".


In [90]:
ls ./movielens

Invalid switch - "movielens".


In [91]:
!ls ./movielens

'ls' is not recognized as an internal or external command,
operable program or batch file.


In [92]:
ls


 Volume in drive C has no label.
 Volume Serial Number is 8023-0B54

 Directory of C:\Users\User\Desktop\Pythonude\EDX_data_course

02/07/2020  04:52 PM    <DIR>          .
02/07/2020  04:52 PM    <DIR>          ..
05/20/2019  10:19 PM             6,148 .DS_Store
02/07/2020  01:50 PM    <DIR>          .ipynb_checkpoints
02/05/2020  02:26 PM            20,544 boolean indexing and datatypes .ipynb
02/07/2020  10:59 AM               132 data.txt
05/26/2017  02:38 PM                 0 Icon_
02/03/2020  02:31 PM            19,170 Intro to Data.ipynb
02/03/2020  04:45 PM             8,475 Intro to numpy.ipynb
02/07/2020  04:52 PM            42,943 Introduction to Panda.ipynb
05/21/2019  04:12 PM            29,840 Introduction to Pandas.ipynb
02/07/2020  11:23 AM            28,176 Numpy tutorial.ipynb
02/05/2020  04:25 PM         5,927,084 satellite image analysis using numpy.ipynb
02/03/2020  02:44 PM             1,790 Using unix commands in Juypter.ipynb
              11 File(s)      6,084,