# Tests Checklist

NOTE: (Tested) means the checkpoint was carried out

0. Some data visualizations (Tested) 

1. Principle Component Analysis

2. Locally Linear Embedding

3. Multidimensional Scaling

4. Isopmap

5. t-Distributed Stochastic Neighbor Embedding

6. Linear Discriminant Analysis

7. Correlation Metrics (Tested)
    
    a. Pearson (Tested)
    
    b. Kendall (Tested)
    
    c. Spearman (Tested)
    
    
8. Chi Squared Test (Tested)

9. Recurisve Feature Elimination

10. Lasso (Regression) : SelectFromModel

11. Tree-based : SelectFromModel

12. Drawing a HeatMap of the correlation matrix (Tested)
    
    a. Pearson (Tested)
    
    b. Kendall (Tested)
    
    c. Spearman (Tested)
    
    
13. Feature selection methods supported by scikit-learn http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection


# Further Details:
   

## (7)

### Relevant Files

all_correlation_tests.ipynb

100_correlation_tests.ipynb

### Verdict:

Correlation matrix was generated using Kendall, Spearman, and Pearson tests. Two of the assumptions behind calculating Pearson correlation scores is the fact that the distribution of both columns (that are being compared to calculate the score) must be normal and the "noise" (variance) between the independednt variables and the dependent variable is the same across all values of the independent variables. 

The first was tested by plotting a histogram of all the data points in a specific column. (Shown bellow) The data seems to be capped at the extremese (1 and 0) and the points in the middles looks relatively normally distributed with some skewness. I tried researching about if this violates pearson correlation assumptions but could not find any information on it

![all histogram](../images/allhist.png)

The second test was carried out by running a Levene Test that tests the null hypothesis that the input samples are from populations with equal variances (tests if the variance is uniformly equal). The lower the p value returned from the test, the weaker the possibility of the Null Hypothesis. The image bellow is a heatmap of the matrix of p values from a levene test between each column in the dataset: please note that the diagonal values are all filled with zeroes to place stronger emphasis on the levene tests between columns that are not equal to each other.

![levene heatmap](../images/leveneheatmap.png)

Majority of the data points are very close to 0, suggesting that this assumption behind the pearson correlation matrix fails.

The Kendall and Spearman correlation tests do not have any assumptions, so there weren't any required tests

## (12)

### Relevant Files

all_correlation_tests.ipynb

100_correlation_tests.ipynb

### Verdict:

A heatmap of the correlation matrix was plotted to better visualize the relationship between the columns. The image bellow shoes the heatmap generated from pearson, kendall, and spearman correlation matrices, respectively:

![pearson heatmap](../images/pearsoncorr.png)

![kendall heatmap](../images/kendallcorr.png)

![spearman heatmap](../images/spearmancorr.png)

Looks like there the correlation value is high after every alternating 8 columns, regardless of which metric we used. (Although, the kendall correlation matrix heatmap seems to follow this pattern mroe closely). I tried grouping these columns together and decided to call it an "octate group" where each octate holds 15 columns. The following image dispalys a heatmap of the correlation matrix in each octate group:

![octate heatmap](../images/corrmatrixbyoctate.png)


Unlike the first huge matrix, there seems to be a lot more correlation going on when we group the columns into their respective octate group. Furthermore, we can take a look at the scatter plot displaying the relationship between the columns. The following image displays this exact relationship for the first octate group.

![octate0 scatter](../images/octate0scatter.png)

Unlike the following image (showing the sactter plot plotting the relationship between the first 10 columns), the image above has a lot more "obvious" relationships (some are non-linear), because we grouped the columns into octates.

![first10cols scatter](../images/first10colcatter.png)

Drawing out the heatmap of the correlation matrix suggests that we can go from 120 columns to 8 columns (which are storing the relationship between 15 columns in the relative octate group)


## (8)

### Relevant Files

100_chisquared.ipynb

### Verdict:

30 minutes into working with chi squared tests, I realized that Chi square Test of independence can only be used to compare categorical variables

**1 hour compiling summary**

**Total time : 11.5 hours**

1. (2 hrs) researching about feature selection
2. (4 hours) playing around with top-100 csv file
3. (.5 hours) researching about correlation metrics
4. (1 hour) figuring out pearson correlation assumption tests
5. (2.5 hours) carrying out all the tests done in top-100 csv file in the entire dataset
6. (.5 hours) researching about chi2 feature selection
7. (1 hour) compiling summary so far
