# <span style="color:#2462C0">Pandas</span>
***pandas*** is a Python library for data analysis. It offers a number of data exploration, cleaning and transformation operations that are critical in working with data in Python. 

*pandas* build upon *numpy* and *scipy* providing easy-to-use data structures and data manipulation functions with integrated indexing.

The main data structures *pandas* provides are *Series* and *DataFrames*. After a brief introduction to these two data structures and data ingestion, the key features of *pandas* this notebook covers are:
* Generating descriptive statistics on data
* Data cleaning using built in pandas functions
* Frequent data operations for subsetting, filtering, insertion, deletion and aggregation of data
* Merging multiple datasets using dataframes
* Working with timestamps and time-series data

**Additional Recommended Resources:**
* *pandas* Documentation: http://pandas.pydata.org/pandas-docs/stable/
* *Python for Data Analysis* by Wes McKinney
* *Python Data Science Handbook* by Jake VanderPlas

### <span style="color:#2462C0">Import libraries</span>

In [2]:
import pandas as pd

## <span style="color:#2462C0">Introduction to pandas data structures</span>
pandas has two main data structures it uses, namely, ***Series*** and ***DataFromes***
* DataFrames store heterogeneous data types

## <span style="color:#2462C0">pandas Series</span>
One-dimensional labeled array

In [3]:
ser = pd.Series(data = [100,True,300,"foo",500],
                index=['tom', 'bob', 'nancy', 'dan', 'eric'])

In [4]:
ser

tom       100
bob      True
nancy     300
dan       foo
eric      500
dtype: object

In [5]:
ser.index

Index(['tom', 'bob', 'nancy', 'dan', 'eric'], dtype='object')

In [6]:
ser[['nancy','bob']]

nancy     300
bob      True
dtype: object

In [7]:
ser[[4,3,1]]

eric     500
dan      foo
bob     True
dtype: object

In [8]:
ser.iloc[2]

300

In [9]:
'bob' in ser

True

In [10]:
ser

tom       100
bob      True
nancy     300
dan       foo
eric      500
dtype: object

In [11]:
ser * 2

tom         200
bob           2
nancy       600
dan      foofoo
eric       1000
dtype: object

In [12]:
ser[['tom','nancy','eric']] ** 2

tom       10000
nancy     90000
eric     250000
dtype: object

## <span style="color:#2462C0">pandas DataFrame</span>
2-dimensional labeled data structure
### <span style="color:#2462C0">Create a DataFrame from dictionary of Python Series</span>

In [13]:
d = {'one' : pd.Series([100., 200., 300.], index=['apple', 'ball', 'clock']),
     'two' : pd.Series([111., 222., 333., 4444.], index=['apple', 'ball', 'cerill', 'dancy'])}

In [14]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [15]:
df.index

Index(['apple', 'ball', 'cerill', 'clock', 'dancy'], dtype='object')

In [16]:
df.columns

Index(['one', 'two'], dtype='object')

In [17]:
pd.DataFrame(d,index=['apple','ball','clock'])

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
clock,300.0,


In [18]:
pd.DataFrame(d,index=['dancy','ball','apple'],columns=['two','five'])

Unnamed: 0,two,five
dancy,4444.0,
ball,222.0,
apple,111.0,


### <span style="color:#2462C0">Create DataFrame from list of Python dictionaries</span>

In [19]:
data = [{'alex':1,'joe':2},{'ema':5,'dora':10,'alice':20}]

In [20]:
pd.DataFrame(data)

Unnamed: 0,alex,alice,dora,ema,joe
0,1.0,,,,2.0
1,,20.0,10.0,5.0,


In [21]:
pd.DataFrame(data,index=['orange','red'])

Unnamed: 0,alex,alice,dora,ema,joe
orange,1.0,,,,2.0
red,,20.0,10.0,5.0,


In [22]:
pd.DataFrame(data,columns=['joe','dora','alice'])

Unnamed: 0,joe,dora,alice
0,2.0,,
1,,10.0,20.0


### <span style="color:#2462C0">Basic DataFrame operations</span>

In [23]:
df

Unnamed: 0,one,two
apple,100.0,111.0
ball,200.0,222.0
cerill,,333.0
clock,300.0,
dancy,,4444.0


In [24]:
df['one']

apple     100.0
ball      200.0
cerill      NaN
clock     300.0
dancy       NaN
Name: one, dtype: float64

In [25]:
df['three'] = df['one'] * df['two']
df

Unnamed: 0,one,two,three
apple,100.0,111.0,11100.0
ball,200.0,222.0,44400.0
cerill,,333.0,
clock,300.0,,
dancy,,4444.0,


In [26]:
df['flag'] = df['one'] > 250
df

Unnamed: 0,one,two,three,flag
apple,100.0,111.0,11100.0,False
ball,200.0,222.0,44400.0,False
cerill,,333.0,,False
clock,300.0,,,True
dancy,,4444.0,,False


In [27]:
three = df.pop('three')
three

apple     11100.0
ball      44400.0
cerill        NaN
clock         NaN
dancy         NaN
Name: three, dtype: float64

In [28]:
df

Unnamed: 0,one,two,flag
apple,100.0,111.0,False
ball,200.0,222.0,False
cerill,,333.0,False
clock,300.0,,True
dancy,,4444.0,False


In [29]:
del df['two']

In [30]:
df

Unnamed: 0,one,flag
apple,100.0,False
ball,200.0,False
cerill,,False
clock,300.0,True
dancy,,False


In [31]:
df.insert(2,'copy_of_one',df['one']) # 2 is the position where the new column will be
df

Unnamed: 0,one,flag,copy_of_one
apple,100.0,False,100.0
ball,200.0,False,200.0
cerill,,False,
clock,300.0,True,300.0
dancy,,False,


In [34]:
df['one_upper_half'] = df['one'][:2]
df

Unnamed: 0,one,flag,copy_of_one,one_upper_half
apple,100.0,False,100.0,100.0
ball,200.0,False,200.0,200.0
cerill,,False,,
clock,300.0,True,300.0,
dancy,,False,,


---

# <span style="color:#2462C0">Case Study: Movie Data Analysis</span>
Case Study: Movie Data Analysis</p>
<br>This notebook uses a dataset from the MovieLens website. We will describe the dataset further as we explore with it using *pandas*. 

## Download the Dataset

Please note that **you will need to download the dataset**. Although the video for this notebook says that the data is in your folder, the folder turned out to be too large to fit on the edX platform due to size constraints.

Here are the links to the data source and location:
* **Data Source:** MovieLens web site (filename: ml-20m.zip)
* **Location:** https://grouplens.org/datasets/movielens/

Once the download completes, please make sure the data files are in a directory called *movielens* in your *Week-3-pandas* folder. 

Let us look at the files in this dataset using the UNIX command ls.