# pandas

### Intro
Pandas is a data manipulation and data analysis library for Python. It is built on the Numpy package and its key data structure is called the DataFrame.

It is written in Python, C and Cython

**Stable release**: 0.23.4 / 3 August 2018

In [2]:
import pandas as pd
import numpy as np

In [3]:
# explain about dataframes, R influence etc

Loading data from a CSV file 

In [20]:
df = pd.read_csv("data/train.csv")

Find the number of rows in a dataframe

In [13]:
print(df.shape)
# df[1].count()
len(df.index)

(5000, 12)


5000

In [None]:
# figure out how to add a title to the image, worst case write a caption here and center the text

![Comparison of len, count and shape](img/count-len-shape-comparison.png)

In [None]:
# head, tail, parameters

In [4]:
df.head()
df.tail()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.91,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129
10885,2012-12-19 23:00:00,4,0,1,1,13.12,16.665,66,8.9981,4,84,88


In [None]:
# usecols, nrows, skiprows, 

In [7]:
df = pd.read_csv("data/train.csv", nrows = 5000)
df.shape

(5000, 12)

In [18]:
df = pd.read_csv("data/train.csv", nrows = 5000, skiprows = (0,2000)) # include header = None also
df.head()

Unnamed: 0,2011-01-01 00:00:00,1,0,0.1,1.1,9.84,14.395,81,0.2,3,13,16
0,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
1,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
2,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
3,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1
4,2011-01-01 05:00:00,1,0,0,2,9.84,12.88,75,6.0032,0,1,1


In [19]:
df = pd.read_csv("data/train.csv", nrows = 5000, usecols = ['season','workingday','count'])
df.head()

Unnamed: 0,season,workingday,count
0,1,0,16
1,1,0,40
2,1,0,32
3,1,0,13
4,1,0,1


### Identifying and dropping missing values

In [22]:
df.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
dtype: int64

In [None]:
# introduce missing values and then drop those rows

Find the number of unique values in a column (Especially useful to find the number of unique classes in a categorical variable) 

In [24]:
df['season'].unique() # can be applied only to a column, not the entire dataframe

array([1, 2, 3, 4], dtype=int64)

In [27]:
df['season'].nunique()

4

## Memory Optimization with pandas

Why optimize memory? 

Memory usage can be a major problem while dealing with large datasets. Let's say the dataset size is 4GB. Loading this onto the RAM and doing operations will become very costly (time-wise). This is where we try to reduce the memory consumed. 

Source - This Kaggle [Kernel](https://www.kaggle.com/yuliagm/how-to-work-with-big-datasets-on-16g-ram-dask)

- TIP 1 - Deleting unused variables and gc.collect()
- TIP 2 - Presetting the datatypes
- TIP 3 - Importing selected rows of the a file (including generating your own subsamples)
- TIP 4 - Importing in batches and processing each individually
- TIP 5 - Importing just selected columns
- TIP 6 - Creative data processing
- TIP 7 - Using Dask