# Project Walkthrough

## Creating Cohort of Songs
Steps in exact order
1. Perform data cleansing for removing outliers, missing values, duplicates (`drop_duplicates()`)
2. Perform data preprocessing:
    - More feature engineering
    - encoding
    - ignore scaling so you can perform EDA first
    - convert release date to datetime and extract date parts. Hint: use `to_datetime()`, then for example extract year part using `df['release_date'].dt.year`  
3. EDA:
    - We can take the average popularity score for each album and use `nlargest()`
    - Conduct exploratory data analysis to delve into various features of songs, aiming to identify patterns
        - year on x-axis and numeric values (e.g. energy) on y-axis and observe the pattern/trend for each (line graph)
        - check for association between numeric features:
            - correlation matrix (linear analysis)
            - scatter/regplot for numeric columns e.g. popularity vs acousticness (linear or non-linear)
        - Examine the relationship between a song's popularity and various factors, exploring how this correlation has evolved:
            - create a new column for decade
            - build a popularity vs acousticness plot for each decade (do the same for the rest of the numeric values)
        - Provide insights on the significance of dimensionality reduction techniques. Share your ideas and elucidate your observations:
            - Perform analysis of variance for PCA and suggest the appropriate number of PCs to get a 90 to 95 cumsum variance
4. Cluster analysis:
    - Use Kmeans and the elbow method to identify the proper number of clusters
    - After building the clusters, perform the mean, median, etc... analysis on every numeric feature by cluster.
    - Visualize the clusters

## Employee Turnover Analytics

1. Same standard DQA
2. EDA
    - Build a heatmap for numeric columns
    - Check if the target is imbalanced using bar plots
    - Build histograms for all 3 columns
    - Use `countplot()` with left as hue
3. Build the clusters 
    - based on 2 features (Satisfaction and last evaluation) with filter left =1
    - use Kmeans with 3 clusters
    - An example of a cluster you may observer: some employees have high evaluation score (high performers), but left the company and weren't satisfied
4. Working with imbalanced data
    - Preprocessing using encoding and standardization
    - no need to separate. you can perform `get_dummies()` on the whole dataframe, just specify the categorical columns
    - Use `SMOTE()` for imbalanced data
    - In `train_test_split()` use `stratify=y` **note how we changed the order compared to the problem statement as it makes more snense to split after SMOTE()**
5. Model Building:
    - Build all 3 models with 5-fold cross-validation
    - For each model, run `sklearn.metrics.classification_report(y_true, y_pred)`
6. Build AUC/ROC curves (for each model)
    - Build ROC/AUC for each mode
    - Plot all 3 curves and compare
7. Steps:
    - Pick the best model. For example, for lineargression, suse `LR_model.predict_proba(X_test)` to calculate the probabilities for X_test
    - Using the probabilities, break down the outcome into 4 groups:
        - Safe Zone (Green) (Score < 20%)
        - Low Risk Zone (Yellow) (20% < Score < 60%)
        - Medium Risk Zone (Orange) (60% < Score < 90%)
        - High Risk Zone (Red) (Score > 90%)
    - Build all the probabilities as a column and create another column based on the conditions above


    