# pandas

### Intro
Pandas is a data manipulation and data analysis library for Python. It is built on the Numpy package and its key data structure is called the DataFrame.

It is written in Python, C and Cython

**Stable release**: 0.23.4 / 3 August 2018

In [2]:
import pandas as pd
import numpy as np

In [3]:
# explain about dataframes, R influence etc

Loading data from a CSV file 

In [20]:
df = pd.read_csv("data/train.csv")

Find the number of rows in a dataframe

In [13]:
print(df.shape)
# df[1].count()
len(df.index)

(5000, 12)


5000

In [None]:
# figure out how to add a title to the image, worst case write a caption here and center the text

<center>[Source](https://stackoverflow.com/questions/15943769/how-do-i-get-the-row-count-of-a-pandas-dataframe)</center>

![Comparison of len, count and shape](img/count-len-shape-comparison.png)


In [None]:
# head, tail, parameters

In [4]:
df.head()
df.tail()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
10881,2012-12-19 19:00:00,4,0,1,1,15.58,19.695,50,26.0027,7,329,336
10882,2012-12-19 20:00:00,4,0,1,1,14.76,17.425,57,15.0013,10,231,241
10883,2012-12-19 21:00:00,4,0,1,1,13.94,15.91,61,15.0013,4,164,168
10884,2012-12-19 22:00:00,4,0,1,1,13.94,17.425,61,6.0032,12,117,129
10885,2012-12-19 23:00:00,4,0,1,1,13.12,16.665,66,8.9981,4,84,88


In [None]:
# usecols, nrows, skiprows, 

In [7]:
df = pd.read_csv("data/train.csv", nrows = 5000)
df.shape

(5000, 12)

In [13]:
df = pd.read_csv("data/train.csv", nrows = 5000, skiprows = (0,2000)) # include header = None also
df.head()

Unnamed: 0,2011-01-01 00:00:00,1,0,0.1,1.1,9.84,14.395,81,0.2,3,13,16
0,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
1,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
2,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
3,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1
4,2011-01-01 05:00:00,1,0,0,2,9.84,12.88,75,6.0032,0,1,1


In [19]:
df = pd.read_csv("data/train.csv", nrows = 5000, usecols = ['season','workingday','count'])
df.head()

Unnamed: 0,season,workingday,count
0,1,0,16
1,1,0,40
2,1,0,32
3,1,0,13
4,1,0,1


### pandas.to_datetime
When a csv file is imported and a Data Frame is made, the Date time objects in the file are read as a string object rather a Date Time object and Hence it’s very tough to perform operations like Time difference on a string rather a Date Time object. Pandas to_datetime() method helps to convert string Date time into Python Date time object.



## Identifying and dropping missing values

In [22]:
df.isnull().sum()

datetime      0
season        0
holiday       0
workingday    0
weather       0
temp          0
atemp         0
humidity      0
windspeed     0
casual        0
registered    0
count         0
dtype: int64

In [None]:
# introduce missing values and then drop those rows

Find the number of unique values in a column (Especially useful to find the number of unique classes in a categorical variable) 

In [24]:
df['season'].unique() # can be applied only to a column, not the entire dataframe

array([1, 2, 3, 4], dtype=int64)

In [27]:
df['season'].nunique()

4

## Sorting and Grouping



## Merge

![Different Parameters of Merge](img/pandas-merge-join-different-variable-names.png)

[Source](https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/)

**Types of merges**

- Inner Merge / Inner join – The default Pandas behaviour, only keep rows where the merge “on” value exists in both the left and right dataframes.


- Left Merge / Left outer join – (aka left merge or left join) Keep every row in the left dataframe. Where there are missing values of the “on” variable in the right dataframe, add empty / NaN values in the result.


- Right Merge / Right outer join – (aka right merge or right join) Keep every row in the right dataframe. Where there are missing values of the “on” variable in the left column, add empty / NaN values in the result.


- Outer Merge / Full outer join – A full outer join returns all the rows from the left dataframe, all the rows from the right dataframe, and matches up rows where possible, with NaNs elsewhere.



## Memory Optimization with pandas

Why optimize memory? 

Memory usage can be a major problem while dealing with large datasets. For example, the PUBG challenge and Microsoft challenge. Let's say the dataset size is 4GB. Loading this onto the RAM and doing operations will become very costly (time-wise). This is where we try to reduce the memory consumed. 


Source - This Kaggle [Kernel](https://www.kaggle.com/yuliagm/how-to-work-with-big-datasets-on-16g-ram-dask)

- TIP 1 - Deleting unused variables and gc.collect()
- TIP 2 - Presetting the datatypes
- TIP 3 - Importing selected rows of the a file (including generating your own subsamples)
- TIP 4 - Importing in batches and processing each individually
- TIP 5 - Importing just selected columns
- TIP 6 - Using Dask



Load the PUBG dataset and show the memory usage. Show dtype of each column. (memory_usage and info). show ram. Then delete it and do gc.collect(). 
Show RAM again. 

In [8]:
import pandas as pd
import numpy as np
# df = pd.read_csv("data/PUBG_train.csv")
df = reduce_mem_usage(df)
df.memory_usage().sum()/1024**2
# del df

# import gc
# gc.collect()

288.38516998291016

In [11]:
df.dtypes

Id                  object
groupId             object
matchId             object
assists               int8
boosts                int8
damageDealt        float16
DBNOs                 int8
headshotKills         int8
heals                 int8
killPlace             int8
killPoints           int16
kills                 int8
killStreaks           int8
longestKill        float16
matchDuration        int16
matchType           object
maxPlace              int8
numGroups             int8
rankPoints           int16
revives               int8
rideDistance       float16
roadKills             int8
swimDistance       float16
teamKills             int8
vehicleDestroys       int8
walkDistance       float16
weaponsAcquired      int16
winPoints            int16
winPlacePerc       float16
dtype: object

In [None]:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    #start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    #end_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    #print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
# explain the above code

What the above code does is that for every non-character column in the dataframe (non-object), it first identifies if the column has interger values or floating point values, based on the first 3 characters of that column's `dtype`. 

Then it finds the minimum and maximum values of that column and compares these values to find which 