# Types of Learning

Resources:
https://blogs.nvidia.com/blog/2018/08/02/supervised-unsupervised-learning/

Check Comparison Table here: https://www.aitude.com/supervised-vs-unsupervised-vs-reinforcement

1. Supervised Learning:
    * Labelled Data is available to help you with what the output can be for given set of inputs.
    * Your output is defined what you want to get out of data, you just have to devise a way to calculate that effectively and accurately.
    * Typical problem areas:
        1. Classification - Predict the class of the given observation
        2. Regression - Predict the continuous value for the given observation
        
    * Issues:
         It's hard to find labelled datasets
    * Common Algorithms:
         Naive Bayes, SVM, Linear/Logistic Regression, KNN


2. Unsupervised Learning:
    * Deals with unlabelled data.
    * It is more of exploratory analysis where we try to find the hidden patterns within the data.
    * Think of it as what you want to get out of data, it's just you suspect some hiddent relationships and associations can be there in data.
    * There's no correct answer, hence it is difficult to determine the accuracy of models.
    * Typical problem areas:
        1. Clustering: Cluster the similar looking data into 1 group.
        2. Association: Find the most common associations or hidden associations (Market-Basket Analysis)
        3. Autoencoders: Learn the representation and associations from the data to recreate them. Popular with GANs
        4. Anomaly Detection: Is the data behaving as it is expected to be?
    
    * Common ALgorithms:
        K-Means Clustering, PCA, GANs, Apriori(Association)


3. Reinforcement Learning:
    * Given a start state and end state, can an optimal way  to accomplish a particular goal be found or performnce of a specific task can be improved.
    * It tries to predict the next best step to take to achieve the final big reward, by interacting with the env.
    * Autonomous vehicles are the most popular application of this.
    * Common Algorithms:
        Deep Q Network(DQN), Q Learning, SARSA
    * It is not unsupervised: We know what exact output is expected from our model.
    * It is not supervised: It doesn't rely on labelled data.


# Data Analysis Workflow

Industry standard is to follow CRISP-DM (Cross Industry Standard for Data Mining).

https://www.sv-europe.com/crisp-dm-methodology/

![CRISP-DM.jpeg](attachment:CRISP-DM.jpeg)

# Data Exploration / Data Understanding
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

https://www.analyticsvidhya.com/blog/2013/11/simple-manipulations-extract-data/

#### Step 1: Variable Identification & Analysis:
* Questions you try to answer:
    1. What each variable represents?
    2. What is its purpose?
    3. What is the datatype of each variable?
    4. What is the spread or min/max ranges?
    5. Does the data have particular affinity/skewness?
    6. Are there any missing values or outliers?

#### Types of Data:
* Based on Structure:
    * Structured: Well-formatted. Usually in form of tables.
    * Unstructured: Text, images, videos.

* Based on Data's Type:
    * Text
    * Continuous: Numerical
    * Categorical: Discrete. Can be Numerical or Text

   Categorical data is further divided as:
    * Nominal - Categorical: No specific order
    * Ordinal - Categorical: Specific Order

#### Step 2: Univariate Analysis:
* Analyse each variable independently.
* Very good in highlighting missing values and outliers
* For Continuous variables:
    
    Determine measure of central tendency: Mean, Median, Mode, Min, Max
    
    Determine Measures of dispersion: Range, Quartiles, IQR, Variance, Standard deviation, Skewness, Kurtosis
* Histogram and Boxplots can significantly help in analysis.

![Quartile%20n%20IQR.gif](attachment:Quartile%20n%20IQR.gif)

* For Categorical variables:
    Count or % Frequency distribution of each category.

#### Step 3: Bivariate Analysis:
* Determines relationship between 2 variables.
* Relationship combinations can be:
    1. Continuous - Continuous
    2. Continuous - Categorical
    3. Categorical - Categorical
* Typical questions that you can ask:
    1. How 2 variables are related?
    2. What will be the impact of increasing or decreasing one on another?

#### Continuous - Continuous
* Analyse via Scatter Plot
    It can show you relationship but can't determine the measure for strength of relationship.
    
    ![Scatter%20Plot.png](attachment:Scatter%20Plot.png)
 
 
* Correlation:
    Corr(X,Y) = CoVariance (X,Y) / SQRT(VAR(X) * VAR(Y))
    * Its values is between -1 and 1
    
    -1 => Perfect negative Correlation
    
    +1 => Perfect Positive Correlation
    
    0 => No Correlation

    Usually:
    
            abs(Corr) >= 0.7 Strongly correlated
        
            abs(Corr) >= 0.45 Needs further analysis, it can be a chance as well.
        
            abs(Corr) <= 0.45 Not Correlated
    
    Problem: Correlation is symmetric i.e. Corr(X,Y) = Corr(Y,X), but in real world, hardly it is the case.
    Real world is hight asymmetric.
    
    eg: Given a Pincode, you can easily find the city, but given city, you can't find the pincode

##### Categorical - Categorical
* To determine which 2 values often come together.
    1. Matrix/Table of values from 2 categories.
       * We populate the matrix by count or count%
       * You can use pandas.crosstab() to build frequency table.
    2. Stacked bar charts.
    3. Chi-Square Test: Derives the statstical significance of relationships b/w 2 variables
     
     A good introduction to performing chi_square test in python: https://machinelearningmastery.com/chi-squared-test-for-machine-learning

In [1]:
import pandas as pd
data = [('A', 1), ('A', 2), ('A', 1), ('A', 4),
 ('B', 3), ('B', 2), ('B', 1),
 ('C', 3), ('C', 2), ('C', 3) 
]

df = pd.DataFrame(data, columns = ['First', 'Second'])
print(df)


  First  Second
0     A       1
1     A       2
2     A       1
3     A       4
4     B       3
5     B       2
6     B       1
7     C       3
8     C       2
9     C       3


In [2]:
pd.crosstab(df.First, df.Second)

Second,1,2,3,4
First,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,2,1,0,1
B,1,1,1,0
C,0,1,2,0


#### Continuous - Categorical
 * You determine the statstical significance between 2 variables by Z-Test/T-Test or ANOVA.

* Z-Test and T-Test are almost same. The only difference is T-Test is carried out when the number of observations in both categories is <=30.

## Some Stastical Tips
https://www.kdnuggets.com/2019/06/statistics-data-scientists-know.html

### What does quartiles represent?
* Quartiles represent the distribution of data. Each of the quarter represents 25% of data i.e. If you pick a number, there is 25% chance that it is going to be in 1st quarter(min - 25 percentile)
* Median divides your data in 2 equal halves. So, 50% point is known as Median.
* IQR or Inter Quartile Range represents 50% of data
* IQR = Q3(75%) - Q1(25%)
* Anything beyond +- 1.5 IQR is outlier.

### Skewness
* Skewness tries to measure the asymmetry of data.
* If skewness is positive then data is left aligned from the center, if it is negative thend data is rgiht aligned.
* It gives us an idea of how close our data is to Gaussian distribution since, many of the ML Algorithm works on the assumption that data has Gaussian Distribution(mean of 0 and SD of 1)
* Skewness = [3 * (Mean - Median)] / Standard Deviation
* So, From above Formula:
    1. If mean == median => Data is perfectly balanced
    2. If mean > median or mean - median > 0 => Data is right skewed(Positive Skewness).
    3. If mean < median or mean - median < 0 => Data is left skewed (Negative Skewness).

![Skewness.png](attachment:Skewness.png)