<a href="https://colab.research.google.com/github/jayeshkum/Learn-Data-Science/blob/master/Data_Science_Python_Script_Repository.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science Repository of Python codes**

This document intends to create a repository of Python codes, useful for any Data Scientist.

At times, we know what to do, but may not remember how to. Ultimately resorting to internet search for required code blocks to do the regular exercises.

Importing required libraries:
Default: Pandas, NumPy, OS, Matplotlib, Seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
# To get the output of matplotlib in the document itself, rather than seperate window
import seaborn as sns
import os

Basic commands for using OS library:
os.chdir("D:\Pandas") will change directory to D:\Pandas


**Reading Data and basic information, description of data**



### Reading CSV File ###
To read a .csv file in the Data Frame, named as 'data'. This command will convert "blank" cells in CSV file to "nan" (read as Not a Number). It will also create an index column in dataframe (1st column) to represent row numbers:

*data = pd.read_csv("path to file")*


Remove the extra column id by passing index_col=0:

*data = pd.read_csv("path to file", index_col=0)*

There may be cases, when the CSV file contains a value to denote junk/blank data, such as "??", "###" etc. In that case, ask pandas to read these values as 'na_values':

*data = pd.read_csv("path to file", index_col=0, na_values=["??", "###"])*


### Reading a TXT file ###
To read a text file (.txt) in the Data Frame, named as 'data'. This command will convert "blank" cells in CSV file to "nan" (read as Not a Number). It will also create an index column in dataframe (1st column) to represent row numbers:

*data = pd.read_table("path to file")*

This will read all data in one columns, unless provided with a seperator/delimiter, as:

*data = pd.read_table("path to file", sep = "\t")*
*data = pd.read_table("path to file", delimiter = "\t")*

Assuming the text file has data columns, seperated by Tab (denoted as \t), in case of other delimiter in the above code, replace \t with actual delimiter (for example, in case of space delimited, use delimiter = " ".

One can also use read_csv to read text file, by replacing sep=",". As CSV file is nothing but a text file with "," as a seperater.

### Reading Excel file ###
At times data may reside in an excel file, with multiple sheets. For reading data from one particular sheet of the excel file, use read_excel:

*data = pd.read_excel("path to file", sheet_name="data sheet")*

### Copy Data Frame ###
There are two kinds of Copy options:

1. Shallow Copy: *data_copy = data.copy(deep=False)* 
This will create a copy of the 'data' as 'data_copy', by creating a new variable that referes to same object. Any changes made in 'data_copy' will change 'data' as well.

2. Deep copy: *deep_data_copy = data.copy(deep=True)*
In case of a deep copy, a copy of the object is copied in another object with no reference to original object. Any changes made in the deep copy will have no effect on the original copy. (Recommendation: Use deep copy or create a new dataframe, if data is small)

##Attributes of Data##
Next step, after reading the data from data source is to understand basic structure of data.

In [None]:
# Row labels of dataframe 'data' can be obtained by:
data.index

# Similarly, column labels of 'data' can be obtained by:
data.columns

# Dataframe size (total number of attributed):
data.size

# Shape of data:
data.shape

In [None]:
# Dimension and memory usage can be obtained by:
data.ndim
data.memory_usage()


Python slicing operators: [] and . are used for indexing. Providing quick and easy access to pandas dataframe

For example:

*data.head(10)* 

Is the function to return the top /head /first 10 rows of the dataframe. Change the number '10' to print more /less number of rows. Use *.tail* command to get bottom /tail /last rows.


### Indexing and selecting data ###
To access a scalar value, use 'at' (label based) or 'iat' (index based)method.

*data.at[row_number,'Column_name')*

*data.iat[4,6]*

To access a group of rows /columns, use '.loc[ ]' or 'iloc[ ]' method.

*data.loc[:,"Column_name"]*

*data.iloc[:,5]*


In [None]:
# Checking data types #
data.dtypes

# Getting unique counts of different datatypes in the dataframe
data.get_dtype_counts()

In [None]:
# Selecting data, based on the data type:
data.select_dtypes(exclude=[object])

# Above code will select part of dataframe, excluding columns with data type of "Object"

### Summary / Exporation of Data Frame ###
Understanding the basic structure of data frame is critical for any analysis. Followed by cleaning/formatting data in readable/usable format.

### Information about dataframe


In [None]:
# info() Provides column names, number of non-null observations, and data type; followed by memory usage.
data.info()

# Describe gives 5 point summary of numerical columns.
data.describe()

# Mean, Median
data['Col_Name'].mean()
data['Col_Name'].median()

# Value_Counts, gives frequency count of variable
data['Col_Name'].value_counts()


In [None]:
# Finding unqiue observations in a column:
print(np.unqiue(data['Col_Name']))

### Data conversion , reformating, reshape etc ###


In [None]:
# Converting data type of columns, using 'astype'
data['Col_Name'] = data['Col_Name'].astype('object')
data['Col_Name'] = data['Col_Name'].astype('int64')
data['Col_Name'] = data['Col_Name'].astype('float')



In [None]:
# Replacing values in columns using .replace / .where
data['Col_Name'].replace('Old_Data','New_data',inplace=True)

In [None]:
# Detecting missing values, lists all columns and missing values:
data.isnull().sum()

In [None]:
# Filling NaN values with mean/median
data['Col_Name'].fillna(data['Col_Name'].mean(), inplace = True)

# Without inplace=True, original dataset would not change.

## EDA ##


In [None]:
# Frequency table / cross tabulation

pd.crosstab(index=data['Col_Name'], columns='count', dropna=True)

# Useful for categorical variables.

In [None]:
# For two variable cross-tabulation
pd.crosstab(index = data['Col_Name_1'], columns = data['Col_Name_2'], dropna = True)


In [None]:
# Two way table, join probablity:
pd.crosstab(index = data['Col_Name_1'], columns = data['Col_Name_2'], normalize = True, dropna= True)

# For marginal probability, add " margins = True " in the options. 
# For conditional probability, use " normalize = 'index' " rather than True, to get row-wise conditional probability. 
# Using "normalize = 'columns' " would give column wise conditional probability.


### Plots, Correlation ###

In [None]:
# Correlation table:
data.corr()


In [None]:
# Scatter plot

plt.scatter(data['Col_1'], data['Col_2'], c = 'red')
# Scatter plot of Col_1 vs Col_2, using 'red' color
# Add title, xlable, ylable: 
plt.title("Title")
plt.xlable("X-Lable")
plt.ylable("Y-Lable")
plt.show() # To show the chart

# Histogram, Frequency distribution (dividing data in 10 bins, using blue color)
plt.hist(data['Col_Name'], bins=10, color = 'blue')

# Bar plot
plt.bar(data['Col_Name_1'], data['Col_Name_2'])


In [None]:
# Distribution Plot
sns.distplot(data['Col'], kde=False) # KDE line

# Regression Plot/ Scatter
sns.set(style = "darkgrid")
sns.regplot(x = data['Col_Name_1'], y = data['Col_Name_2'], fit_reg = True, marker ="*")

# By deafult, fit_reg is True, change it to false for removing the regression line from the plot. 
# Change marker to denote points differently.


In [None]:
# Add Hue in line plot

sns.lmplot(x = 'Col_1', y = 'Col_2', data = dataframe_name, fit_reg = False, hue = 'Col_3', legend = True, palette = "Set1")

# Bar / Count Plot

sns.countplot(x = 'Col_Name_1', data = 'dataframe', hue = 'Col_Name_2')


In [None]:
# Box / Wiskers plot 

sns.boxplot(y = data['Col_Name'])

sns.boxplot(x = data['Col_Name_1'], y = data['Col_Name_2'], hue = data['Col_Name_3'])

# Pairwise plots

sns.pairplot(data, kind = 'scatter', hue = 'Col_Name')


## Tips & Tricks ##

1. While giving "index" as parameter, setting index=0 means row-wise, and index=1 means column wise.


In [None]:
# To get the mode of 'Col' variable:
data['Col'].value_counts().index[0]


In [None]:
# Imputing missing values by mean for dataframe using lambda function

filled_data = data.apply(lambda x: x.fillna(x.mean()) if x.dtype == 'float' else x.fillna(x.value_counts().index[0]))

# This command repalces na values with column mean for float type and mode for other (i.e. object type)

In [None]:
driver = webdriver.PhantomJS(executable_path=r'E:\PATH\phantomjs\bin\phantomjs')
#Web scraping add PATH
# pip install phantomjs
# pip install webdriver-manager
# pip install urllib3
# pip install bs4
# pip install requests
# pip install selenium
# pip install geckodriver