List of common commands used in most Python projects; created as a quick reference library

Github Repo:

https://github.com/munishkumar-gh/munishkumar-gh.github.io

## Table of Contents
<!--TABLE OF CONTENTS-->

- [Libraries](#Libraries)
- [Functions](#Functions)
- [Globals](#Globals)
- [Read CSV](#Read-CSV)
- [Conclusion](#Conclusion)
- [Write out file](#Write-out-file)
- [Copy data frame](#Copy-data-frame)
- [Create new df, reset index](#Create-new-df,-reset-index)
- [Cast to list](#Cast-to-list)
- [Cast to date-time](#Cast-to-date-time)
- [Check for Element in List](#Check-for-Element-in-List)
- [Concatenate 2 df](#Concatenate-2-df)
- [Data Explore](#Data-Explore)
- [Drop Columns/rows](#Drop-Columns/rows)
- [Diff of 2 data frames with identical cols](#Diff-of-2-data-frames-with-identical-cols)
- [Groupby](#Groupby)
- [% changes in dataframe between rows](#%-changes-in-dataframe-between-rows)
- [Set as Index](#Set-as-Index)
- [Split Dataframe to arrays](#Split-Dataframe-to-arrays)
- [Visualization](#Visualization)
      - [Pairplot for EDA](#Pairplot-for-EDA)
      - [Single plot Heatmap](#Single-plot-Heatmap)
      - [Subplot with Barplot & Violinplot](#Subplot-with-Barplot-&-Violinplot)
      - [Histogram](#Histogram)
      - [Scatter plot with X-Y line](#Scatter-plot-with-X-Y-line)
      - [Residual Plot](#Residual-Plot)
- [Normalize (Math)](#Normalize-(Math))
- [One hot encoding](#One-hot-encoding)
- [Out of Sample Set](#Out-of-Sample-Set)
- [Feature Selection](#Feature-Selection)
- [sklearn - Train Test Split](#sklearn---Train-Test-Split)
- [sklearn - Normalize Features](#sklearn---Normalize-Features)
- [Confusion Matrix](#Confusion-Matrix)
- [Example of supervised: Logistic](#Example-of-supervised:-Logistic)
- [Example of Unsupervised: K-Means](#Example-of-Unsupervised:-K-Means)
- [Others](#Others)

# Libraries

In [None]:
# General Libraries
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import NullFormatter
import time
import re
import requests
import pickle
import seaborn as sns
import os
import glob
import sys
sns.set()

import datetime
from datetime import timedelta, date 
start = time.time()
%matplotlib inline

# Forces the print statement to show everything and not truncate
# np.set_printoptions(threshold=sys.maxsize) 
print('Libraries imported')

In [None]:
# Only uncomment if none of the following is installed

#!conda install -c anaconda scikit-learn --yes
#!conda install -c anaconda scipy --yes
#!conda install -c anaconda dash --yes
#!conda install -c anaconda plotly --yes
#!conda install -c anaconda multiprocess --yes
#!conda install -c anaconda nltk --yes
#!conda install -c https://conda.anaconda.org/conda-forge wordcloud --yes
#!conda install -c conda-forge textblob --yes
#!conda install -c districtdatalabs yellowbrick --yes

# Advanced Libraries 
# Sklearn & NLP libraries
from sklearn import preprocessing
import nltk
from nltk import FreqDist
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer # Preprocessing - Stemming
from textblob import TextBlob #spelling corrections
from textblob import Word # Preprocessing - Lematization 
from wordcloud import WordCloud, STOPWORDS
from yellowbrick.text import TSNEVisualizer

# Word Embedding + ML libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  #Word embedding
from sklearn import model_selection, linear_model, naive_bayes, ensemble, metrics, preprocessing  # different ML
from xgboost import XGBClassifier #ML algo

# Functions

In [None]:
def acb():
    return

In [None]:
# To color a cell background

from IPython.display import HTML, display
from IPython.display import Image

# To use, just type set_background('XXX') where XXX is whatever colour you want
def set_background(color):         
    script = ("var cell = this.closest('.code_cell');" 
              "var editor = cell.querySelector('.input_area');"         
              "editor.style.background='{}';"         
              "this.parentNode.removeChild(this)"     
             ).format(color)      
    display(HTML('<img src onerror="{}">'.format(script)))

# Globals

In [None]:
dir_name = 'C:/a/b/c/'
filename_suffix = 'csv'

# Read CSV

In [None]:
# Means read in the ',' as thousand seperator
df = pd.read_csv('XYZ.csv', thousands=',')

# Conclusion

In [None]:
count = 'Completed Process'
elapsed = (time.time() - start)
print ("%s in %s seconds" % (count,elapsed))

# Write out file

In [None]:
base_filename = 'Clean_Data_raw'
csvs_sht = os.path.join(dir_name, base_filename + "." + filename_suffix)
df.to_csv(csvs_sht, index = True, header = True)
print ("Final File Extract Produced:", base_filename + "." + filename_suffix)

------ 

# Copy data frame

In [None]:
df2 = df.copy()
df2 = df[df.isnull().any(axis=1)]

# Create new df, reset index

In [None]:
df2 = df[df['a']!='Not assigned']
df2 = df2.rename_axis('b').reset_index()

df = raw.filter(regex='ab')

# Cast to list

In [None]:
dfa = df['XYZ'].tolist()

# Cast to date-time

In [None]:
df['Mth'] = pd.to_datetime(df['Mth'])

# Check for Element in List

In [None]:
lst = {"a", "b", "c", "d"}
df[df['XYZ'].isin(lst)]

# Concatenate 2 df

In [None]:
combine = pd.concat([df2,df1], ignore_index=True).reset_index(drop=True)

# Data Explore 

In [None]:
y[0:5]
df.head()
df.tail(5)
df.info()
df.describe(include='all')
df.columns.values # Column Names
df.shape

df['a'].mean()
df['a'].unique()

df.count()
df.sum()
df.isnull()
df.values
df.any()

df.sort_values(['a'], inplace = True)
df2 = df.sort_values(by='a',axis = 0, ascending=True)

col={'a': 'b', 'c':'d'}
df.rename(columns=col, inplace=True)

In [None]:
## print('The dataframe has {} uniques.'.format(len(df['a'].unique())))
## print("Delta:", raw.shape[0]-proc.shape[0])
    
## Missing Values - df.isnull().values.any()
## Total Data Points - df.count().sum()
## Total number of missing values - df.isnull().sum().sum()
## Outlier - (df_pre_proc == 0).astype(int).sum(axis=0)

Refer to 
http://localhost:8888/notebooks/Anaconda3/%40Completed/Weather/3_Pre_Processing_Gov_Data_Part_2.ipynb
to infill missing data or 0 data using a normal dist


# Drop Columns/rows

In [None]:
df2 = df.dropna(axis = 0) #rows, axis = 1 for col

cols = ['a', 'b', 'c']
df = df.drop(cols, axis = 1)

df = df.loc[:, ~df.columns.str.contains('XyZ')]

# Diff of 2 data frames with identical cols

In [None]:
df_new = df1.sub(df2.squeeze())

# Groupby

In [None]:
df.groupby(df['a']).sum()
df.groupby(df['a']).count()

df.groupby('a')['b'].transform('sum')
df.groupby('a')['b'].transform('count')>5

# Iterate through columns in Dataframe

In [None]:
adf_result = {}
for col in df_preproc.columns.values:  
    adf_result[col] = sts.adfuller(df_preproc[col])

# % changes in dataframe between rows

In [None]:
df = df.pct_change(fill_method ='ffill') 

# Set as Index

In [None]:
#Using Multiindex
df.index = [f'{y}-{m}' for y, m in df.index] #Assume Date-time
df = df.rename_axis('month').reset_index()

df.set_index("A", inplace = True)

# Change Frequency only for time series
df  = df.asfreq('MS')

# Split Dataframe to arrays

In [None]:
df_a, df_b = np.array_split(data, 2, axis=1) # Split along col into 2 array
df = df_a.melt(var_name='groups', value_name='vals') # repeat for df_b

# Visualization

#### Pairplot for EDA

In [None]:
sns.set_palette("bright")
sns.pairplot(df[['a', 'b', 'c', 'd']])

#### Single plot Heatmap

In [None]:
plt.figure(figsize=(10, 10))
ax = sns.heatmap(df.corr(), annot=True, fmt=".2f")

#### Subplot with Barplot & Violinplot

In [None]:
Height=10
Width=10
fig, axes = plt.subplots(nrows=nrows, ncols=cols)
fig.set_figwidth(width)
fig.set_figheight(height)
sns.barplot(x="groups", y="vals", data=df, ax=axes[0]) #Refer to 'Split Dataframe to arrays'

sns.violinplot(x="groups", y="vals", data=df, ax=axes[0])

#Loop labels
for i in range(nrows):
    ax = axes[i]
    if i == 0:
        ax.set_title('Violin Plots - Delta Diff')
        ax.set_ylabel('Frequency')
        continue
    if i == 1:
        ax.set_ylabel('Frequency')
        continue
    if i == 2:
        ax.set_xlabel('Categories')
        ax.set_ylabel('Frequency')

#### Histogram

In [None]:
sns.distplot(df['a'], ax=axs[0])

# Line
sns.lineplot(x=df['b'], y=df['a'], color='black', ax=axs[1])
plt.title('XXX')
plt.xlabel('YYY')
plt.ylabel('ZZZ')

#### Scatter plot with X-Y line

In [None]:
RedFunction = y_test # Test Data Set
BlueFunction = yhat_MLR # Predicted Data Set

plt.scatter(RedFunction, BlueFunction, label="Test vs Predicted", s=25, c = color3, zorder=10)
xymax = round(max(RedFunction.max()+5, BlueFunction.max()+5), -1)
lims = [
    np.min([ax.get_xlim(), ax.get_ylim()]),  # min of both axes
    np.max([ax.get_xlim(), ax.get_ylim()]),  # max of both axes
]
plt.plot(lims, lims, 'k-', alpha=0.75, zorder=0)
#plt.axes().set_aspect('equal')
plt.xlim(0, xymax)
plt.ylim(0, xymax)
plt.legend(loc='upper left')

plt.show()
plt.close()

#### Residual Plot

In [None]:
cmap = sns.cubehelix_palette(dark=.9, light=.1, as_cmap=True)
sns.residplot(RedFunction, BlueFunction, ax=ax, label = "Residual",
                  scatter_kws={"cmap": cmap}, lowess=True, color='r')

------ 

# Normalize (Math)

In [None]:
norm = (df-df.min())/(df.max()-df.min())

# One hot encoding

In [None]:
df_onehot = pd.get_dummies(df[['XYZ']], prefix="", prefix_sep="")

# Out of Sample Set

In [None]:
# Creates a mask where values that are true go into the training/test set
# Note that I done it so that the random number is predictable
msk = np.random.seed(0)
msk = np.random.rand(len(df))<0.8

raw_train_test_set = df[msk]
raw_validate_set = df[~msk]

------ 

# Feature Selection

In [None]:
Feature = df[['a', 'b',]]
x=Feature

y = df['c'].values

# sklearn - Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

random_state = 42
test_size = 0.2

x_train, x_test, y_train, y_test  = train_test_split(
            x, y, test_size = test_size, random_state = random_state
)

print('Train Set: ', x_train.shape, y_train.shape)
print(x_train['a'][0:5])
print('Test Set: ', x_test.shape, y_test.shape)
print(x_test['a'][0:5])

# sklearn - Normalize Features

In [None]:
# Only on Feature set
from sklearn. preprocessing import StandardScaler

X_train = preprocessing.StandardScaler().fit(x_train).transform(x_train)
X_test = preprocessing.StandardScaler().fit(x_test).transform(x_test)
print('Normalized X Training Set: ', X_train[0:5])
print('Normalized X Testing Set: ', X_test[0:5])

------ 

# Confusion Matrix

In [None]:
#The accuracy score with a logistic regression is ~0.7.
#A confusion matrix may be more valuable as it allows me to visualize the algorithm performance:
#
#TN / True Negative: when a case was negative and predicted negative
#TP / True Positive: when a case was positive and predicted positive
#FN / False Negative: when a case was positive but predicted negative
#FP / False Positive: when a case was negative but predicted positive
#Precision = TP/(TP + FP)
#Recall = TP/(TP + FN)
#F1 Score = 2*(Recall * Precision) / (Recall + Precision)

def plot_conf_mat(cnf_matrix, classes, normalize, cmap, width, height):
    plt.figure(figsize=(width, height))
    if normalize == True:
        # np.newaxis - make it as column vector by inserting an axis 
        # along second dimension
        cnf_matrix = cnf_matrix.astype('float')/ cnf_matrix.sum(
            axis=1)[:,np.newaxis]
        print("Normalized Confusion Matrix")
    else:
        print("Confusion Matrix, non-normalized")
    
    #imshow() - creates image from 2D numpy array.
    plt.imshow(cnf_matrix, interpolation = 'nearest', cmap=cmap)
    plt.title('Confusion Matrix')
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks (tick_marks, classes, rotation=45)
    plt.yticks (tick_marks, classes)
    
    fmt = '.2f'if normalize else 'd'
    thres = cnf_matrix.max()/2
    for i, j in itertools.product(range(cnf_matrix.shape[0]), range(cnf_matrix.shape[1])):
        plt.text(j, i, format(cnf_matrix[i, j], fmt),
                 horizontalalignment='center',
                 fontsize=20,
                 color = 'yellow' if cnf_matrix[i, j] > thres else 'white'
                )
    # plt.tight_layout()
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.grid(None)
    return

In [None]:
# Set normalize to true for the confusion matrix to print normalized values
normalize = False
cmap = plt.cm.seismic
width=10
height=width/2

plot_conf_mat(conf_mat_LR, 
              classes=['Wins (0)', 
                       'Losses (1)'],
              normalize=normalize, cmap=cmap, width=width, height=height)

------ 

# Example of supervised: Logistic

In [None]:
from sklearn. linear_model import LogisticRegression
from sklearn. metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score 

C = 0.0001

LR = LogisticRegression(C = C, solver = 'liblinear')
LR.fit(X_train, y_train)

yhat_LR = LR.predict(X_test)

# accuracy_score(y_true, y_pred)
mean_acc_LR = accuracy_score(y_test, yhat_LR)
conf_mat_LR = confusion_matrix(y_test, yhat_LR)

print('Logistic Regression')
print('==============================================\n')
print("True values:", y_test[0:5].round(1))
print("Pred values:", yhat_LR[0:5].round(1))
print('\n')
print('Mean Accuracy:', mean_acc_LR)
print('\n')
print('F1 Score:\n',classification_report(y_test, yhat_LR))

# Example of Unsupervised: K-Means

In [None]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

kclusters = 3
# Run K-means
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df)

#Check cluster label
kmeans.labels_[:10]

# add cluster label
df2["Cluster Label"] = kmeans.labels_

# Combine to Original df
df2 = cc2.join(df.set_index("XYZ"), on="XYZ")

#sort by cluster labels
df2.sort_values(['Cluster Label'], inplace = True)
df2.dropna(axis=0, inplace=True)

# Others

Codes - LoL_Diamond_Ranked_Games
1. Logistic Regression
2. Decision Tree
3. Support Vector Machine
4. Random Forest
5. Neural Net via Tensorflow

http://localhost:8888/notebooks/Anaconda3/%40Completed/LoL_Diamond_Ranked_Games/LoL_Diamond_Ranked_Games.ipynb



Codes - 4_Supervised_Classification
1. Multilinear Regression
2. Polynomial Regression
3. Ridge Regression
4. K Nearest Neighbours
5. Decision Tree
6. Support Vector Machine

http://localhost:8888/notebooks/Anaconda3/%40Completed/Weather/4_Supervised_Classification.ipynb