# Data Science Projects 

This document is there to help you sturcture your data science project. The flow of this guide will help you craft a data science project quickly and pay attention to some of the key elements of problem solving.

When you are working on a data science project, these are some of the key parts that you need to work on: 

- Problem Statement
- Data Acquisition
- Data Dictionary
- Feature extraction
- Data Cleaning  
- EDA and Data Visualization
- Deriving Key insights from EDA
- Model building
- Evaluation 
- Deriving Key Insights from model
- Exporting the model


## Step 1: Problem statement

You need clearly define the problem that you are solving. Typically when you are working with simple datasets like the UCI machine learning repository dataset then your problem statement is decided by your dataset. However the entire process of problem defintion involves translating business goals to analysis goals. 

Ideally you should choose a specific domain, choose a problem to solve then find data to solve the problem. In most cases exact data may not be available, then you can look at various data sources and combine data to get the required dataset. In practice however, you may have decided on what problem to solve on the basis of the what data is available and accessible to you. So you may have to go in the opposite direction where you need to decide on a dataset and then decide what questions I can ask of it and then attack the problem. 

Either way keep in mind, that in the process you need to learn about different domains.

**--provide examples here--**

- [ ] Clearly state your data source.
- [ ] What type of data do you have? 
    - Structured/Unstructured.
- [ ] What are you predicting?
- [ ] What are your features? 
- [ ] What is your target? 
- [ ] What type of problem is it? 
    - [ ] Supervised/Unsupervised? 
    - [ ] Classification/Regression? 
- [ ] If you have to combine features to define the target, discuss that here.
- [ ] Do you need to combine multiple data sources? 

## Step 2: Data Acqusition

In this section you will talk about how you found the data. 

- [ ] Make sure that you mention the source of your data.
- [ ] If you are using multiple datasets, make sure you mention the source of all of those datasets 
and mention why you are using the datsets 



##  Step 3: Data Dictionary 
In this section you will be generating a data dictionary for your dataset. Data dictionaries are extremely important since they give us a quick glance at the various types of data we have. 

Typically each column of your dataset will have a data type associate with it. The data types that you would be typically using are: 

- Int
- Float
- String
- Categorical
- Datetime 

Try to cast each variable into one of these data types. The data dictionary would be a table with: 
- One column as a the feature/target name
- One column for the data type
- One column with a short description of the feature/target

You can generate markdown tables at :
https://www.tablesgenerator.com/markdown_tables


## Step 4: Feature Extraction

- If you have unstructured data then in this step you need to extract features from the data to generate a dataset

## Step 5: Data cleaning

Some points to keep in mind: 

- [ ] Find missing values. 
- [ ] Find NaN and 0 values. 
- [ ] Do all columns have the same dtypes?
- [ ] Convert dates to datetime types.
    - [ ] You can use the python package arrow or datetime.
- [ ] Convert categorical variables to type 'category' if working with pandas. 
- [ ] Convert strings to ints or floats if they represent numbers.
- [ ] Standardize strings
    - [ ] Convert them to lower case if possible.
    - [ ] Replace spaces with underscores or dashes.
    - [ ] Remove white spaces around the string **this is very critical**.
    - [ ] Check of inconsistent spellings *typically done manually*.
- [ ] Look for duplicate rows or columns.
- [ ] Look for preprocessed columns; example: A categorical column that has been duplicated 
    with categorical labels.
    
A list of data cleaning libraries: https://mode.com/blog/python-data-cleaning-libraries/ 


## Step 6:  Data preperation

- [ ] Convert categorical features to dummy indices if you are doing regression or assign numerical labels if you are doing classification
- [ ] Do test train split to generate a test set. Further do a train validation split, you will need to run the test train split function from sklearn twice for this purpose


## Step 7: Exploratory Data Analysis and Data Visualization

There are multiple steps that you need to take here: 

- [ ] Identify outliers in the datsets. Keep track of them, we want to run to train the model with the outliers and without them to see their effect.
- [ ] Check for imbalance in the target variable. Quantify the imbalance. 
- [ ] Pairplot if possible to check the relationship between all the features and the target.
- [ ] Look at the histogram for each variable, try to identify if you have a symmetric or normal distribution.
- [ ] If possible plot a QQ plot to check the normality of the data. If you want more information, refer to [this](https://refactored.ai/learn/normality-tests/24c311b1936a4037b29ef78d629f1320/).
- [ ] If its a classification problem, run a chi-square test between each categorical feature and the target to check for correlation and run ANOVE between the continuous/discrete features and the target to check for correlations.
- [ ] If its a regression problem get pearson correlations between the continuous features and target and run ANOVA between each categorical variable and target.
- Check for correlations between individual features; use similar approaches as you did with the target. 

## Step 8: Key Insights from EDA
In this section you will present what you have learnt from performing Exploratory Data Analysis. At the end of this section you should have written down the following:

- [ ] A bullet point list of relationship between features and target and between individual features.
- [ ] A written summary of what the conclusions of the exploratory data analysis.

The first point involves writing down what type of correlations observed between each feature and target. For regression you can state the correlation value ( r-squared value) between the feature and the target for classification you can state the p-value of the chi-square test with the conclusion of weather you are accepting or reject the null hypothsis. The same should be done for between individual features. 

The second part involves writing a small summary of part 1. You should mention in words the conclusion that you reached. 

There may be situations where you maybe compelled to drop a varible because you have either too many outliers or it has strong correlations with the output. You may even want to create new variables using external data that you have imported. You should document and discuess these changes in this section. 


### Data visualizations

For each pair of variables (i.e feature vs feature and feature vs target) 
you need to have a seperate section with visualizations. Based on the type of target and feature you may need to choose an approrpriate plot (Histogram, time-series plots, scatter plots etc)

Make sure that you label the plots properly: 
Each plot should have the following: 
- [ ] Readable and descriptive axis labels.
- [ ] Font size of atleast 12 for the x-y axis tick labels.
- [ ] A title.
- [ ] A legend that label a curve. Even if you have a single curve.
- [ ] If you are using a scatter plot, no fancy points, just use circles, triangles and other simple geometric shapes.
- [ ] Make sure that everything is easy to read plot. Use colors meanigfully; most plots do not need several colors. 
- [ ] Try to make sure that you start at the origin (x=0, y=0). Be aware of the scale of your data. 


 Make sure you save your final visualizations in the ```/images/``` folder and call them from the images folder. Make sure you given the images meaningful names. For example you can name an EDA image between two features as "eda_scatter_feature1_feature2.png". Everyone has their own naming convention. Make sure that you stick to that convention. 
 
### Importing Images into a Jupyter notebook 
 There may be times where you might want to import images into jupyter notebooks. These might be other plots or plots you have generated and saved. 
 
 Images can be imported to a jupyter notebook from a different directory using a relative path. 
A realtive path example is: 

```../../../images/image1.png```

This means that we are going 3 folders up and then into the image folder and referencing ```image1.png```. 
In order to import the image we would write

```<img src="../../../images/image1.png">``` 

where we are using the ```img``` html tag to generate the image. This is written down in a markdown cell of the jupyter notebook. NOT a code cell. 

It is best practice to write down the above html tag in a seprate markdown cell. This just make it easier to access and troubleshoot. 

**Note: If you are a DS1 student then head straight to the summary section**

## Step 9: Model building

This section is relevant to only DS2 and DS3 students. 

Here you will train a model on your training set. Make sure you train the model then get predictions on the training set. We want to keep track of this since we want to check for overfitting. We will compare the results from the training set to testing set to see if we are overfitting. 

Typically, if you have high training accuracy (or low r-square value for regression) and low testing set accuracy ( or high r-squared value for regression) it mean you are overfitting. In such a situation you need to go to a more complex model. Perhaps use regularization or get more data. 


For DS2 students. You would have been exposed to only- Linear regression, Polynomial regression, Logistic regression and Naive bayes. Hence you get to choose from these algorithms based on the type of problem, classification/regression, that you have chosen 

For DS3 students. You have learnt a whole bunch of algorithms and techniques hence this a place to try them out. Start with a simple baseline model and then try more complex models. Example for classification: 
- Start with baseline model as - Logistic regression or Naive Bayes
- Then try Decision tree
- Then try SVC 
- Then Random forests
- Then XGBoost

The same can be said for regression. Running multiple models and looking at the accuracy and confusion matrix will help you understand how to judge each model on the training set. 

If possible you want to utlize gridsearch to find parameters. Gridsearch is great for finding parameters especially when you have a many of them. 
https://scikit-learn.org/stable/modules/grid_search.html



## Step 10: Model evaluation 
Once you manage to train your model. You must evalue its performance on the training set. 

## Step 11: Key Insights from Predictive analysis 

## Step 12: Exporting the model

## Step 13: Project Summary