# Simple Intro to Pandas

This tutrorial looks at the use of the popular python library Pandas. This is widely used to work with big datasets in ways which one can explore, clean, transformm and wrangle.  

**Tutorial Structure**
- [Preamble](#Preamble)
- [Import Data](#Import-Data)
 - [Creating DataFrames](#Creating-DataFrames)
 - [Read Files](#Read-Files)

# Preamble

In [1]:
%load_ext autoreload
%autoreload 2
# install im_tutorial package
!pip install git+https://github.com/nestauk/im_tutorials.git

Collecting git+https://github.com/nestauk/im_tutorials.git
  Cloning https://github.com/nestauk/im_tutorials.git to /tmp/pip-req-build-ga47_v9v
  Running command git clone -q https://github.com/nestauk/im_tutorials.git /tmp/pip-req-build-ga47_v9v
Building wheels for collected packages: im-tutorials
  Building wheel for im-tutorials (setup.py) ... [?25ldone
[?25h  Created wheel for im-tutorials: filename=im_tutorials-0.1.0-cp36-none-any.whl size=12596 sha256=a05a978aace5b1cd4cbe16a82484b908eb88e658deb253a4e692cd420f7e0713
  Stored in directory: /tmp/pip-ephem-wheel-cache-nt6qbak1/wheels/47/a3/cb/bdc5f9ba49bcfd2c6864b166a1566eb2f104113bf0c3500330
Successfully built im-tutorials


In [2]:
# numpy for mathematical functions
import numpy as np
# pandas for handling tabular data
import pandas as pd
# explained later
from im_tutorials.data import cordis
import matplotlib.pyplot as plt

# Import Data

## Creating DataFrames

There are cases where you may hardcode for hacking-uses. This is one way to create a dataframe from scratch.

In [9]:
# useful for hacking
df_1 = pd.DataFrame(
    {'col1' : ['a', 'b', None,'c'],
    'col2' : ['d', 'e', 'f','g'],
    'col3' : [1, 2, 3, None],
    'col4' : [4, 5, 6, 7]}
)

In [None]:
df_1

## Read Files

In [None]:
#i if working from a local .csv file
df = pd.read_csv('file.csv')

In [3]:
# maybe use cordis
cordis_projects_df = cordis.h2020_projects()

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


# A Look at the Data
<br/>
It is almost protocol to look at what's inside your dataset before you start to answer questions. Pandas allows us to easily explore and draw up basic analysis using some of the libraries methods and functions.

Sometimes, we want a peek at what is going inside. The functions `.head()` and `.tail()` displays the top n rows or last n rows, respectively. Here, `n = 5` rows by default. You can adjust the number of rows by simply changing the number.

In [None]:
cordis_projects_df.head(n=3)

In [None]:
cordis_projects_df.tail(n=3)

As seen above, dataframes are a tabilar data structire consiting of rows and columns. Indexes are used to identifyu rows whislt columns are identified by the column names. Both are appendable. 

In [None]:
cordis_projects_df.index

In [None]:
cordis_projects_df.columns

There are cases where you may want to apply some calculations across rows or columns. This can be appoached by using the `axis` of the dataframe. This can be accessed using the `axis` parameter found in many methods (many are `axis = 0` but default). A few examples will be shown throughout this tutorial.
- Axis 0- apply on all rows across each column
- Axis 1- apply on all columns across each row

<img src="../reports/figures/axis.png">


In [None]:
cordis_projects_df.dtypes
#panadas way of saying there's non-numerical data in the column

In [None]:
#can look at columns separtely 
cordis_projects_df['subjects']

In [None]:
#list of columns
cordis_projects_df[['status', 'subjects']]

In [None]:
cordis_projects_df['topics'].value_counts()

In [None]:
cordis_projects_df.shape

## Maths & Summaries
<br/>
Many 

In [None]:
cordis_projects_df.sum()

In [None]:
# we can also use axis here 
cordis_projects_df.sum(axis=1)

In [None]:
# count of number of elements present in column
cordis_projects_df.count()

In [None]:
# or if you want the result of one column (can do this for any method)
df['col3'].count()

In [None]:
# only on numerical columns
cordis_projects_df.describe()

**CAUTION!** <br/>
Use these functions with caution. Some fields are includ data such as IDs or serial numbers but are taken into account when drawing up analysis. It's up to the user to carefully decipher what makes sense. 

In [None]:
#or can separately get this results
df.mean()

**_Task_**:
Now try other methods such as `.min()`, `.max()`, `.median()`, `.var()`, `.std()` and quantile `.quantile()`.

In [None]:
# write code here

In [None]:
#can add across columns

df['col3'] + df['col4']

Now try add two columns that have different datatypes and see what happens


In [None]:
# write code here

# Filtering & Subsets

In [None]:
# getting the subset of where the condition is true 
cordis_projects_df[cordis_projects_df['coordinatorCountry'] == 'UK'].head()

In [None]:
#instead of this, can use .loc 

cordis_projects_df.loc[cordis_projects_df['coordinatorCountry'] == 'UK']

In [None]:
# loc iloc
#up to top 4 rows (shows rows 0,1,2,3)
cordis_projects_df.iloc[:4]

#up to index '4' (shows rows 0,1,2,3,4)
cordis_projects_df.loc[:4]

#subset of is null/not null



In [None]:
# cordis_projects_df

In [None]:
# loc based on the name of labels in index
# iloc based on position in index

# Data Wrangling

## Cleaning

### Dropping Data

In [None]:
# drop columns (a list of column names or single!)
df.drop(columns, inplace=True, axis=1)

In [None]:
# drop rows 
df.drop(rows, inplace=True, axis=0)

## Handling Missing Data

In [None]:
cordis_projects_df.isnull()

In [None]:
cordis_projects_df[cordis_projects_df['participants'].isnull() == True]

Descriptions

In [None]:
# another way to drop columns where all elements are nan
#use toy here
df.dropna(axis = 1, how='any')

In [None]:
# another way to drop rows where all elements are nan; here axis =0 by default
df.dropna(how='any')

There are other `how` parameter options. See what happens when `how` equals `all`

Drop duplicates (rows)

In [None]:
df.drop_duplicates()

#can drop duplicates based on one column 
df.drop_duplicates(subset="col", inplace=True)

In [10]:
df_1.replace(np.nan, 0.0)#, inplace=True) #inplace makes permanent changes without having to replace the df vare

Unnamed: 0,col1,col2,col3,col4
0,a,d,1.0,4
1,b,e,2.0,5
2,0,f,3.0,6
3,c,g,0.0,7


In [11]:
df_1

Unnamed: 0,col1,col2,col3,col4
0,a,d,1.0,4
1,b,e,2.0,5
2,,f,3.0,6
3,c,g,,7


## Delete/Add Rows

## Rename & Resets

In [None]:
# renaming columns
# reset index

## Tranformation

In [None]:
# add new columns
cordis_projects_df['half_totalCost'] = cordis_projects_df['totalCost'] * 0.5
#group data & apply a function & mergeb
#refer back to adat types and how we change datatype of column

In [None]:
#transpose data

df.T

##

### Mapping

In [None]:
#dcitionary

In [None]:
# 

### GroupBy

In [None]:
#preferable group by categorical -type column
grouped_df = cordis_projects_df.groupby(by=['status', 'ecMaxContribution'])

Here, we can investigate statistical results of each numerical column based on the groups defined by applying the methods form earlier.

In [None]:
grouped_df.mean()

In [None]:
grouped_df.sum()

In [None]:
# can replace values 
#avoid lambda- use numpy funtion instead
grouped_df['totalCost'].apply(lambda x: x/x.count()) #divided by the count of the grouped

In [None]:
cordis_projects_df.head()

In [None]:
# can groupby on different levels and cal on diff levels 

In [None]:
#careful cos some columns may have integers but are IDs 
#so up to you to make decision what to do with it 

# Combining Data

Firtsly, let's define another toy example. 

In [None]:
df_2= pd.DataFrame(
    {
        'col1': ['a', 'b', 1, 2], 
        'col2': ['d', 'e', 'f', 'k'], 
        'col3': [5, 6, 7, 8],
        'col4': ['h', 'i', 'j', 'k']
    }
)

In [None]:
df_1

In [None]:
df_2

## Merge

In Pandas, there are various ways to merge: `left`, `right`, `inner`, `outer`. Here, we have to specify 

In [None]:
# merge - `left` specifies the first is the main dataframe and the other is mergeing with it 
# note, the columns must be the same in order for a smooth merge
#note the col chosen of the chosen df of is constant and everything is depent on that (Whether it includes the lement in it's respective row)
pd.merge(
    df_1,
    df_2,
    how='left', 
    on= 'col2'
)



**_Task_**:
Now see what happens when the `how` parameter equals `right`, `inner` and `outer`.

In [None]:
# write code here

## Concatenate

In [None]:
#concatenate- in a way like stacking 
#Used to append two or more dataframes on-top or sideways 
pd.concat([df_1,df_2]) #default axis= 0

In [None]:
pd.concat([df_1,df_2], axis=1)# where it's concatenated in ord on 

# Working with...

Here, we will look at different non-numeric data type and how we can work with them in pandas.

## DateTime
<br />
Luckily, the column. In a lot of cases, datetime information are stored as strings. Thankfully, pandas can deal with this. These strings are transformed to datetime objects.

In [None]:
#in this dataset, these dates are already datetime objects 
#write example
cordis_projects_df['startDate'] = pd.to_datetime(cordis_projects_df['startDate'])
cordis_projects_df

In [None]:
# for example, the first row 
print(cordis_projects_df['startDate'][0].year)
print(cordis_projects_df['startDate'][0].month)
print(cordis_projects_df['startDate'][0].day)

In [None]:
#use lambda to apply a method to each element row-wise if want to apply 
#can add 
cordis_projects_df['month'] = cordis_projects_df['startDate'].dt.month
cordis_projects_df['month'].head()

## Strings

In [None]:
# can convert any column elements to string type .str

#.str.len()to get length of each string across 0

#.str.replace() 

# Plotting

## Pandas Plotting
https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

In [None]:
#histogram plot - all pandas plota are matplotlib figs but are specifically pandas functions
cordis_projects_df['month'].hist()
# plt.show()

In [None]:
cordis_projects_df.columns

In [None]:
cordis_projects_df['totalCost'].plot(kind='bar')

In [None]:
# grouped_df.size().unstack().plot(kind= 'bar', stacked=True)

#show few examples directly from df 

## Plotting using MatplotLib

https://matplotlib.org/3.1.1/contents.html

In [None]:
#to show they can be thrown into 
# one example with matplotlib

# Available Datasets

How to read these