# Template [Notebook](https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/) for DS Project
Short description of the project

## Step 1: Understand the problem and define the project
### Understand the problem
* Read about the field
* Talk to experts
* Recap problem statement

### Define the project
* Recap stakeholders
* Set timeline
* Define KPIs

### Load necessary modules and packages

In [None]:
# import usual libraries
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import cufflinks as cf
%matplotlib inline

# REST request libraries
import requests
from pandas.io.json import json_normalize
import json

# SQL database packages
import pandabase
import psycopg2
import sqlalchemy

# profiling
import pandas_profiling as pp

# dates and times and time zones and timestamps
import datetime as dt
import time
import pytz

# create UUIDs
import uuid

# set aesthetic parameters
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

### Helpful cheat-sheets
- Pandas: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- SQL: https://cdn.sqltutorial.org/wp-content/uploads/2016/04/SQL-cheat-sheet.pdf
- Scikit-Learn: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

## Step 2: Data acquisition
### Datasources
Describe the different data sources

**Possible sources:**
* Internal data bases
* Available APIs from used services
* Publicly available data sets
* Requests to capture certain data

In [None]:
# Read local files
train = pd.read_csv('train.csv')


In [None]:
# API code snippets
# define the url for the request
url = 'www.api.com'
# create a dictionary of headers containing our Authorization header.
headers = {"Authorization": "token 1f36137fbbe1602f779300dad26e4c1b7fbab631"}
# define necessary parameters
parameters = {"lat": 37.78, "lon": -122.41}
# Make a GET request
response = requests.get(url, headers=headers, params=parameters)
# get json data from response
json_data = response.json()
# store it in a data frame
data_df = json.normalize(json_data)

In [None]:
# SQL code snippets
# Connect to Postgres DB
try:
    conn = psycopg2.connect("dbname='template1' user='dbuser' host='localhost' password='dbpass'")
except:
    print "I am unable to connect to the database"
# Define a cursor to work with
cur = conn.cursor()
# Run a query through the cursor
cur.execute("""SQL query here""")
# Store the fetched data
rows = cur.fetchall()
# Close connection to DB
cur.close()
conn.close()

## Step 3: Exploratory data analysis - clean and understand data
- Inspect your data sets and figure out how you can combine them
- Identify outliers, missing values, or human error
- Ask questions to the specialist to understand all variables and relationships
- Extract important variables and leave behind useless variables
- Form first hypotheses
- Clean your data: Make it homogenous, take care of missing data, remove duplicates in rows or columns, reclassify discrete variables if values are similar
- Handle privacy data (tag them and make sure you're compliant)

In [None]:
# helpful methods for first insights
df.shape() # show no. of rows and columns
df.info() # show columns
df.head/sample/tail() # show sample columns
df.columns # show columns of data set
df.nunique(axis=0) # shows no. of unique values per column
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f'))) # summarizes the count, mean, standard deviation, min, and max for numeric variables (following code formats data for better reading)

# forming ProfileReport and save as output.html file 
profile = pp.ProfileReport(df) 
profile.to_file("output.html")


In [None]:
# helpful methods for cleaning data:

# Reclassify: if row.column in value_list return value -> apply to column
df.drop([columns], axis=1) # drop (duplicated) columns: 

#Drop columns with more than x % NA values:
NA_val = df_cleaned.isna().sum()
def na_filter(na, threshold = .4): # only select variables that pass the threshold
    col_pass = []
    for i in na.keys():
        if na[i]/df_cleaned.shape[0]<threshold:
            col_pass.append(i)
    return col_pass
df_cleaned = df_cleaned[na_filter(NA_val)]
df_cleaned.columns

df[df[column] >/</==/.between(low, high)] # remove outliers: 
df.dropna(axis=0) # remove rows with Null values

In [None]:
# Helpful methods on finding relationships between attributes:

# print correlation heatmap:
corr = df.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True)

# scatterplot to display relationship of two variables:
df.plot(kind='scatter', x=col1, y=col2)

# combine histogram per attribute and scatterplot for all relationships:
sns.pairplot(df)

# explore a single variable: 
df[col].plot(kind='hist', bins=123) #histogram
df.boxplot(col) # boxplot

## Step 4: Enrich data set with additional data
- Get most value out of the data set by combining data, clean time-based attributes
- Analyze relationships between the variables
- Try to not reinforce bias

## Step 5: Build helpful visualizations for communication
- Visualization is the best way to explore and communicate your findings
- Effective way to quickly communicate a lot of information in a short period of time
- Make the visualizations interactive and intuitive

## Step 6: Get predictive - machine learning
- Machine learning algorithms can help you go a step further into getting insights and predicting future trends
- Unsupervised clustering algorithms can build models to uncover trends in the data that were not distinguishable in graphs and stats
- Supervised algorithms can predict future trends
- Once a model is deployed, we need to operationalize it - it should not stay unused on the shelves

In [None]:
bb

## Step 7: Iterate and maintain
- Prove the effectiveness of the project as fast as possible to justify the project
- Maintain the model as the input and environment can change over time