<a href="https://colab.research.google.com/github/kaushil24/DS-Cookbook101/blob/master/Code_Snippets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Script having recursively used important functions of ML
* Pandas
* Numpy
* Scikitlearn
* Pyplot
* Seaborn


# Pandas

## Set a column as index

```df.set_index(col_name, drop=bool)``` <br>
* col_name = name of column to set as index
* drop = True if you want to drop the indexed column

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html

## Dropping a column in data frame

```df.drop(name, axis, inplace=bool)```
* name = name of columns or a list of columns OR keys of rows
* axis = 0-> rows | 1->columns
* inplace = If True, makes changes in iteslt. If false, returns a new dataframe with dropped columns

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

## Sum of columns in dataframe

``` df.sum() ``` <br>
Returns a pd.Series with the keys being the column name of dataframe and the values being the sum of those columns. <br>
*i.e.* It is used to find sum of all the columns <br>
* **NODE** It the columns is a string, it'll concat all the elements of that columns.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html

## Find total null values in DF

```df.isnull().sum().sort_values(ascending=False)```


## Corelation Matrix

```
df.corr()
```

## Find non null values:

```df.notnull()```
<br>Returns a series. 

## Drop columns with null values greater than a threshold:



```
df.drop(df.loc[: ,df.isnull().sum()> theta].columns, axis=1, inplace=True)
```
**EXPLAINATION**
1. _df.isnull().sum() > theta_ : Returns as series where the key is column name and the values are a bool values. Something of the type: <br>
> * _col_1_ True
> * _col_2_ False
> * _col_3_ False
> ...
This will serve as arguments to df.loc[ ].
2. _df.loc[rows, columns ]_ : The second argument in df.loc is a boolean pd.series. df.loc will return a pd.DataFrame with all the rows and the columns which were "true" in the pd.Series provided. 
3. _df.loc[ ].columns_ : Returns a list of columns, whose total null values are greater than theta.
4. df.drop(): Drops these columns.


## Brief summary of a series or DF

```df.describe() ``` <br>
or <br>
```series.describe()```

## Get info about datatypes of various columns in dataframe

```
df.info()
```

## Fill null values:

```
df.col_name.fillna(value, inplace=True)
<br> **OR**
<br> 
df.loc[df.col_name.isnull(), ['col_name']] = value
```

## Datatype of a cell in Dataframe

**Know the index number lies**
```
df.loc[df['column_name'].isnull(), ['column_name']][:5]
```

```
type(df['com_name'][index_no])
```

## Know datatype of columns in a dataframe

```
df.dtypes
```

## One hot Encoding:

__NOTE__ : At times, some values might not be present in our test data but might otherwise be present in our test data. To deal with such scenario,
refer to this link:
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example

```
df_encoded = pd.get_dummies(df)
```
* It automatically detetcs the categorical features and one hot encodes them and appends them to dataframe and returns this new dataframe.

### In order to make selected features one hot encoded, perform the following:

```
df_encoded = df.join(pd.getdummies(df['categirical_col_1', 'categorical_col_2']).drop(['categorical_col_1', 'categorical_col_2'], axis = 1)
```

# Scikitlearn

## Train test split

``` 
from sklearn.model_selection import train_test_split
trainX, testX, trainY, testY = train_test_split(df1[df1.columns.drop('label_col')], df1['label_col'], test_size=0.2)
```

## Linear Regression Model

```
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(trainX, trainY)
reg.score(trainX, trainY)
```

# Seaborn

## Change figure size:

```
plt.subplots(figsize=(width, hieght))
sns.boxplot(x="col_name", y="col_name", data=data)
```

## Change font size 
### (And other such graph properties.)

```
sns.set(context='notebook', style='darkgrid', palette='deep', font='sans-serif'  font_scale=1, color_codes=True, rc=None)
```

## Distribution Plot:

```
sns.distplot(df['col_name'])
```

## Scatter Plot:

Note: You can also directly pass a vector to 'x' and 'y'. <br>
Docs: https://seaborn.pydata.org/generated/seaborn.scatterplot.html

```
sns.scatterplot(x = 'col_name', y = 'col_name', data = dataFrame, hue = 'col_name')
```

## Box Plot <br>

For categorical features, you can directly pass the string of the feature name and sns will automatically draw an individual boxplot for each category occuring in that feature. 

```
sns.boxplot(x = 'col_name', y = 'col_name', data = dataFrame)
```


## Heat Map

https://seaborn.pydata.org/generated/seaborn.heatmap.html
```
sns.heatmap(cor, annot=True, linewidths = float, linecolor='black')
```
**Note: If you dont want to dipslay the numeric value of each cell, put annot=False**

### Corelation matrix:
An easy way to visualize corealtion is to use heat map. Use the following snippet to display heat map.
```
corr_mtx = df.corr()
sns.heatmap(corr_mtx, annot = True)
```

### Display heat map of only those columns with corelation greater than threshold (theta)
```
cor = df.corr()
cols = cor[cor > theta].notnull().sum() > 1
sns.heatmap(cor.loc[cols, cols])
```

### Highlight cells with corelation values greater than threshold

```
sns.heatmap(cor[cor>theta], annot=True, linewidths=1, linecolor='white')
```

![something like this ](https://i.imgur.com/XWiQsem.png)

## Pair Plot
Used to plot pairwise relationship in a dataset. <br> 
**ONE OF THE BEST CHARTS TO SEE RELATIONSHIPS**

```
sns.pair(df[col_list])
```