# Classifying Song Genres

In [None]:
# This is a code cell without any tag. You can put convenience code here,
# but it won't be included in any way in the final project.
# For example, to be able to run tests locally in the notebook
# you need to install the following:
# pip install nose
# pip install git+https://github.com/datacamp/ipython_nose
# and then load in the ipython_nose extension like this:
%load_ext ipython_nose

## 1. Preparing our dataset

_These recommendations are so on point! How does this playlist know me so well?_

<img src="img/iphone_music.jpg" alt="Project Image Record" width="600px"/>

Over the past few years, streaming services with huge catalogs have become the primary means through which most people listen to their favorite music. But at the same time, the sheer amount of music on offer can mean users might be a bit overwhelmed when trying to look for newer music that suits their tastes.

For this reason, streaming services have looked into means of categorizing music to allow for personalized recommendations. One method involves direct analysis of the raw audio information in a given song, scoring the raw data on a variety of metrics. Today, we'll be examining data compiled by a research group known as The Echo Nest. Our goal is to look through this dataset and classify songs as being either 'Hip-Hop' or 'Rock' - all without listening to a single one ourselves. In doing so, we will learn how to clean our data, do some exploratory data visualization, and use feature reduction towards the goal of feeding our data through some simple machine learning algorithms, such as decision trees and logistic regression.

To begin with, let's load the metadata about our tracks alongside the track metrics compiled by The Echo Nest. A song is about more than its title, artist, and number of listens. We have another dataset that has musical features of each track such as `danceability` and `acousticness` on a scale from -1 to 1. These exist in two different files, which are in different formats - CSV and JSON. While CSV is a popular file format for denoting tabular data, JSON is another common file format in which databases often return the results of a given query.

Let's start by creating two pandas `DataFrames` out of these files that we can merge so we have features and labels (often also referred to as `X` and `y`) for the classification later on.

Read in the data using `pandas` and merge the DataFrames into one usable dataset.

- Using the pandas `read_csv()` function, read in the file with the track metadata (`datasets/fma-rock-vs-hiphop.csv`) and name the DataFrame `tracks`.
- Using the pandas `read_json()` function, read in the JSON file with the track acoustic metrics (`datasets/echonest-metrics.json`) and name the DataFrame `echonest_metrics`. Set the `precise_float` argument to `True` when reading in your data.
- Merge the DataFrames on matching `track_id` values. Only retain the `track_id` and `genre_top` columns of `tracks`. `echonest_metrics` should be the first (left) data frame in the merge.
- Inspect the DataFrame using the `.info()` method.

<hr>

## Good to know

This project lets you apply what you learned in [Supervised Learning with scikit-learn](https://www.datacamp.com/courses/supervised-learning-with-scikit-learn), plus data preprocessing, dimensionality reduction, and machine learning using the `scikit-learn` package. We recommend you are familiar with these topics before starting this project.

Helpful links:
- Documentation for pandas [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html), [`read_json()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html) and [`pd.merge()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) functions
- Variance of the PCA features [exercise](https://campus.datacamp.com/courses/unsupervised-learning-in-python/decorrelating-your-data-and-dimension-reduction?ex=7)
- Train/test/split + Fit/Predict/Accuracy [exercise](https://campus.datacamp.com/courses/supervised-learning-with-scikit-learn/classification?ex=11)

You can select columns of a DataFrame by providing a list to the indexer i.e. using the `[]`. A correct solution for the merge looks like this:

```python
echo_tracks = echonest_metrics.merge(tracks[['column1_name', 'column2_name']], on='column2_name')
```

In [None]:
import pandas as pd

# Read in track metadata with genre labels
tracks = ...

# Read in track metrics with the features
echonest_metrics = ...

# Merge the relevant columns of tracks and echonest_metrics
echo_tracks = ...

# Inspect the resultant dataframe
...

In [None]:
import pandas as pd

# Read in track metadata with genre labels
tracks = pd.read_csv('datasets/fma-rock-vs-hiphop.csv')

# Read in track metrics with the features
echonest_metrics = pd.read_json('datasets/echonest-metrics.json', precise_float=True)

# Merge the relevant columns of tracks and echonest_metrics
echo_tracks = echonest_metrics.merge(tracks[['genre_top', 'track_id']], on='track_id')

# Inspect the resultant dataframe
echo_tracks.info()

In [None]:
%%nose

def test_tracks_read():
    try:
        pd.testing.assert_frame_equal(tracks, pd.read_csv('datasets/fma-rock-vs-hiphop.csv'))
    except AssertionError:
        assert False, "The tracks data frame was not read in correctly."

def test_metrics_read():
    ech_met_test = pd.read_json('datasets/echonest-metrics.json', precise_float=True)
    try:
        pd.testing.assert_frame_equal(echonest_metrics, ech_met_test)
    except AssertionError:
        assert False, "The echonest_metrics data frame was not read in correctly."
        
def test_merged_shape(): 
    merged_test = echonest_metrics.merge(tracks[['genre_top', 'track_id']], on='track_id')
    try:
        pd.testing.assert_frame_equal(echo_tracks, merged_test)
    except AssertionError:
        assert False, ('The two datasets should be merged on matching track_id values '
                       'keeping only the track_id and genre_top columns of tracks.')

## 2. Pairwise relationships between continuous variables

We typically want to avoid using variables that have strong correlations with each other -- hence avoiding feature redundancy -- for a few reasons:
- To keep the model simple and improve interpretability (with many features, we run the risk of overfitting).
- When our datasets are very large, using fewer features can drastically speed up our computation time.

To get a sense of whether there are any strongly correlated features in our data, we will use built-in functions in the `pandas` package.

Explore correlations in our dataset using pandas `corr` function.
- Visually inspect the correlation table generated from `DataFrame.corr()` for any strong correlations.

<hr>

Helpful links:
- pandas DataFrame `corr` method [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html)

Does your code look something like this?

```python
corr_metrics = df.corr()
```

In [None]:
# Create a correlation matrix
corr_metrics = ...
corr_metrics.style.background_gradient()

In [None]:
# Create a correlation matrix
corr_metrics = echonest_metrics.corr()
corr_metrics.style.background_gradient()

In [None]:
%%nose

def test_corr_matrix():
    assert all(corr_metrics == echonest_metrics.corr()) and isinstance(corr_metrics, pd.core.frame.DataFrame), \
        'The correlation matrix can be computed using the .corr() method.'

## 3. Normalizing the feature data

As mentioned earlier, it can be particularly useful to simplify our models and use as few features as necessary to achieve the best result. Since we didn't find any particular strong correlations between our features, we can instead use a common approach to reduce the number of features called **principal component analysis (PCA)**. 

It is possible that the variance between genres can be explained by just a few features in the dataset. PCA rotates the data along the axis of highest variance, thus allowing us to determine the relative contribution of each feature of our data towards the variance between classes. 

However, since PCA uses the absolute variance of a feature to rotate the data, a feature with a broader range of values will overpower and bias the algorithm relative to the other features. To avoid this, we must first normalize our data. There are a few methods to do this, but a common way is through *standardization*, such that all features have a mean = 0 and standard deviation = 1 (the resultant is a z-score).

Prepare our dataset for training a model, and standardize the data.
- Define our features from `echo_tracks` by removing `genre_top` and `track_id` from the DataFrame using `DataFrame.drop()` along axis 1.
- Define our labels -- in this case, the `genre_top` column from `echo_tracks`.
- Import the `StandardScaler` from the `sklearn.preprocessing` module
- Define an instance of the `StandardScaler` called `scaler` without passing any arguments and use the `fit_transform` method to scale `features` and save to a new variable called `scaled_train_features`


<hr>

Helpful links:
- pandas `DataFrame.drop()` method [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)
- Square brackets in Pandas [exercise](https://campus.datacamp.com/courses/intermediate-python-for-data-science/dictionaries-pandas?ex=15)

The `axis` argument of the `drop()` method by convention considers `0` for rows and `1` for columns to look through
to drop. You can drop columns named `col1` and `col2` from the dataframe `df` and save to a new dataframe called
`df_drop` like so:  

```python
df_drop = df.drop(columns=['col1', 'col2'])
```

You can use the `StandardScaler` to standardize the data in `my_data` like so after you define an instance, here called `scaler` using the `fit_transform` method of the `StandardScaler`.

```python
scaler = StandardScaler()
scaler.fit_transform(my_data)
```


In [None]:
# Define our features 
features = ...

# Define our labels
labels = ...

# Import the StandardScaler
...

# Scale the features and set the values to a new variable
scaler = ...
scaled_train_features = ...

In [None]:
# Define our features
features = echo_tracks.drop(columns=['genre_top', 'track_id']) 

# Define our labels
labels = echo_tracks['genre_top']

# Import the StandardScaler
from sklearn.preprocessing import StandardScaler

# Scale the features and set the values to a new variable
scaler = StandardScaler()
scaled_train_features = scaler.fit_transform(features)

In [None]:
%%nose

import sys

def test_dropped_columns():
    assert features.shape == (4802, 8), \
        'Use the .drop method to remove the genre_top and track_id columns.'
        
def test_labels_df():
    assert labels.name == 'genre_top' and labels.shape == (4802, ), \
        'Does your labels DataFrame only contain the genre_top column?'
        
def test_standardscaler_import():
    assert 'sklearn.preprocessing' in list(sys.modules.keys()), \
        'The StandardScaler can be imported from sklearn.preprocessing.'
        
def test_scaled_train_features():
    assert scaled_train_features.shape == (4802, 8) and round(scaled_train_features[0][0], 2) == -0.19, \
        "Use the StandardScaler's fit_transform method to scale your features."

## 4. Principal Component Analysis on our scaled data

Now that we have preprocessed our data, we are ready to use PCA to determine by how much we can reduce the dimensionality of our data. We can use **scree-plots** and **cumulative explained ratio plots** to find the number of components to use in further analyses.

Scree-plots display the number of components against the variance explained by each component, sorted in descending order of variance. Scree-plots help us get a better sense of which components explain a sufficient amount of variance in our data. When using scree plots, an 'elbow' (a steep drop from one data point to the next) in the plot is typically used to decide on an appropriate cutoff.

Use PCA to determine the explained variance of our features.
- Import the `matplotlib.pyplot` module as `plt`, and our `PCA()` class from `sklearn.decomposition`
- Create our PCA class using `PCA()`, fit the model on our `scaled_train_features` using `PCA.fit()`, and retrieve the explained variance ratio
- Make a scree plot of the variance explained by each component

<hr>

We run PCA on all our features at first, which is done by default if `n_components` is not specified.

Helpful links:
- sklearn PCA [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
- matplotlib `bar` plot [documentation](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.bar.html)

You can get the number of components and the explained variance ratio of an array of features named `my_features` from a `PCA` object called `pca`  like this after fitting `pca`.

```python
pca.fit(my_features)
print(pca.explained_variance_ratio_)
print(pca.n_components_)
```

You can create a barplot by passing first the x-values which hold x-coordinates of each bar and then the corresponding y-values which hold the value of each bar, like so:

```python
fig, ax = plt.subplots()
ax.bar(range(6), [5,1,0,2,3,0])
```

When creating the plot, keep in mind that the number of components goes on the x-axis and the explained variance on the y-axis. 

In [None]:
# This is just to make plots appear in the notebook
%matplotlib inline

# Import our plotting module, and PCA class
#... YOUR CODE ...

# Get our explained variance ratios from PCA using all features
pca = ...
...
exp_variance = ...

# plot the explained variance using a barplot
fig, ax = plt.subplots()
ax.bar(..., ...)
ax.set_xlabel('Principal Component #')

In [None]:
# This is just to make plots appear in the notebook
%matplotlib inline

# Import our plotting module, and PCA class
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Get our explained variance ratios from PCA using all features
pca = PCA()
pca.fit(scaled_train_features)
exp_variance = pca.explained_variance_ratio_

# plot the explained variance using a barplot
fig, ax = plt.subplots()
ax.bar(range(pca.n_components_), exp_variance)
ax.set_xlabel('Principal Component #')

In [None]:
%%nose

import sklearn
import numpy as np
import sys

def test_pca_import():
    assert 'sklearn.decomposition.pca' in list(sys.modules.keys()), \
        'Have you imported the PCA object from sklearn.decomposition?'

def test_pca_obj():
    assert type(pca) == sklearn.decomposition.pca.PCA, \
        "Use scikit-learn's PCA() object to create your own PCA object here."
        
def test_exp_variance():
    rounded_array = np.array([0.24, 0.18, 0.14, 0.13, 0.11, 0.08, 0.07, 0.05])
    rounder = lambda t: round(t, ndigits = 2)
    vectorized_round = np.vectorize(rounder)
    assert all(vectorized_round(exp_variance) == rounded_array), \
        'Following the PCA fit, the explained variance ratios can be obtained via the explained_variance_ratio_ method.'
        
def test_scree_plot():
    expected_xticks = [float(n) for n in list(range(-1, 9))]
    assert list(ax.get_xticks()) == expected_xticks, \
        'Plot the number of pca components (on the x-axis) against the explained variance (on the y-axis).'

## 5. Further visualization of PCA

Unfortunately, there does not appear to be a clear elbow in this scree plot, which means it is not straightforward to find the number of intrinsic dimensions using this method. 

But all is not lost! Instead, we can also look at the **cumulative explained variance plot** to determine how many features are required to explain, say, about 90% of the variance (cutoffs are somewhat arbitrary here, and usually decided upon by 'rules of thumb'). Once we determine the appropriate number of components, we can perform PCA with that many components, ideally reducing the dimensionality of our data.

Plot the cumulative explained variance of our PCA.
- Import the `numpy` package as `np`.
- Calculate the cumulative sums of our explained variance using `np.cumsum()`.
- Plot the cumulative explained variances using `ax.plot` and look for the number of components at which we can account for >90% of our variance; assign this to `n_components`.
- Perform PCA using `n_components` and project our data onto these components.

<hr>

Helpful links:
- numpy `cumsum()` function [documentation](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cumsum.html)
- sklearn PCA [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

Don't forget that Python indexing starts at 0 when looking at your plot.

You can use a `PCA` object that has been trained on an array named `my_features` to then project the data onto the components like this:

```python
pca.transform(my_features)
```


In [None]:
# Import numpy
...

# Calculate the cumulative explained variance
cum_exp_variance = ...

# Plot the cumulative explained variance and draw a dashed line at 0.90.
fig, ax = plt.subplots()
...
ax.axhline(y=0.9, linestyle='--')
n_components = ...

# Perform PCA with the chosen number of components and project data onto components
pca = PCA(n_components, random_state=10)
pca.fit(scaled_train_features)
pca_projection = ...

In [None]:
import numpy as np

# Calculate the cumulative explained variance
cum_exp_variance = np.cumsum(exp_variance)

fig, ax = plt.subplots()
ax.plot(cum_exp_variance)
ax.axhline(y=0.9, linestyle='--')
n_components = 6

# Perform PCA with the chosen number of components and project data onto components
pca = PCA(n_components, random_state=10)
pca.fit(scaled_train_features)
pca_projection = pca.transform(scaled_train_features)

In [None]:
%%nose

import sys

def test_np_import():
    assert 'numpy' in list(sys.modules.keys()), \
        'Have you imported numpy?'
    

def test_cumsum():
    cum_exp_variance_correct = np.cumsum(exp_variance)
    assert all(cum_exp_variance == cum_exp_variance_correct), \
    'Use np.cumsum to calculate the cumulative sum of the exp_variance array.'
    
    
def test_n_comp():
    assert n_components == 6, \
    ('Check the values in cum_exp_variance if it is difficult '
    'to determine the number of components from the plot.')
    
    
def test_trans_pca():
    pca_test = PCA(n_components, random_state=10)
    pca_test.fit(scaled_train_features)
    assert (pca_projection == pca_test.transform(scaled_train_features)).all(), \
    'Transform the scaled features and assign them to the pca_projection variable.'

## 6. Train a decision tree to classify genre

Now we can use the lower dimensional PCA projection of the data to classify songs into genres. To do that, we first need to split our dataset into 'train' and 'test' subsets, where the 'train' subset will be used to train our model while the 'test' dataset allows for model performance validation.

Here, we will be using a simple algorithm known as a decision tree. Decision trees are rule-based classifiers that take in features and follow a 'tree structure' of binary decisions to ultimately classify a data point into one of two or more categories. In addition to being easy to both use and interpret, decision trees allow us to visualize the 'logic flowchart' that the model generates from the training data.

Here is an example of a decision tree that demonstrates the process by which an input image (in this case, of a shape) might be classified based on the number of sides it has and whether it is rotated.

<img src="img/simple_decision_tree.png" alt="Decision Tree Flow Chart Example" width="350px"/>

Prepare our training and test sets and train our first classifier.
- Import the `train_test_split()` function from `sklearn.model_selection` module
- Split our projected data into train and test, features and labels, respectively using `train_test_split()`.
- Create our decision tree classifier using `DecisionTreeClassifier()` and a random state of `10` and train the model using the `model.fit()` notation
- Find the predicted labels of the `test_features` from our trained model using the `model.predict()` notation.

<hr>

Helpful links:
- scikit-learn `train_test_split()` function [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- scikit-learn `DecisionTreeClassifier()` class [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

The `train_test_split()` function can be used to split `features` and `labels` as training and test sets like this after importing it from `sklearn.model_selection`:

```python
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
```

You can fit your decision tree classifier named `tree` and train on your `train_features` and `train_labels` like this:

```python
tree = DecisionTreeClassifier()
tree.fit(train_features, train_labels)
```

In [None]:
# Import train_test_split function and Decision tree classifier
# ... YOUR CODE ...

# Split our data
train_features, test_features, train_labels, test_labels = ...

# Train our decision tree
tree = ...
...

# Predict the labels for the test data
pred_labels_tree = ...

In [None]:
# Import train_test_split function and Decision tree classifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Split our data
train_features, test_features, train_labels, test_labels = train_test_split(
    pca_projection, labels, random_state=10)

# Train our decision tree
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)

# Predict the labels for the test data
pred_labels_tree = tree.predict(test_features)

In [None]:
%%nose

import sys

def test_train_test_split_import():
    assert 'sklearn.model_selection' in list(sys.modules.keys()), \
        'Have you imported train_test_split from sklearn.model_selection?'

    
def test_decision_tree_import():
    assert 'sklearn.tree' in list(sys.modules.keys()), \
        'Have you imported DecisionTreeClassifier from sklearn.tree?'
    
    
def test_train_test_split():
    train_test_res = train_test_split(pca_projection, labels, random_state=10)
    assert (train_features == train_test_res[0]).all(), \
        'Did you correctly call the train_test_split function?'
    
    
def test_tree():
    assert tree.get_params() == DecisionTreeClassifier(random_state=10).get_params(), \
        'Did you create the decision tree correctly?'
    
    
def test_tree_fit():
    assert hasattr(tree, 'classes_'), \
        'Did you fit the tree to the training data?'
    
    
def test_tree_pred():
    assert (pred_labels_tree == 'Rock').sum() == 971, \
        'Did you correctly use the fitted tree object to make a prediction from the test features?'

## 7. Compare our decision tree to a logistic regression

Although our tree's performance is decent, it's a bad idea to immediately assume that it's therefore the perfect tool for this job -- there's always the possibility of other models that will perform even better! It's always a worthwhile idea to at least test a few other algorithms and find the one that's best for our data.

Sometimes simplest is best, and so we will start by applying **logistic regression**. Logistic regression makes use of what's called the logistic function to calculate the odds that a given data point belongs to a given class. Once we have both models, we can compare them on a few performance metrics, such as false positive and false negative rate (or how many points are inaccurately classified). 

Train our logistic regression and compare the performance with our decision tree.
- Create our logistic regression model using LogisticRegression() and a random state of `10`.
- Train the model using the `model.fit()` notation and assign the predicted labels for the `test_features` to `pred_labels_logit`.
- Import the `classification_report` from the `sklearn.metrics` package
- Print the classification reports for our trained Decision Tree and Logistic Regression models. 

<hr>

Helpful links:
- scikit-learn `LogisticRegression()` class [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- scikit-learn `classification_report()` function [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

You can fit your `LogisticRegression` named `my_logreg` on `features` and `labels` like this after instantiating an instance of `LogisticRegression`:

```python
my_logreg = LogisticRegresion()
my_logreg.fit(features, labels)
```

You can get the classification report from the labels of each data point called `labels` and the predicted labels called `predicted` like this:

```python
report = classification_report(labels, predicted)
```

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Train our logistic regression and predict labels for the test set
logreg = ...
...
pred_labels_logit = ...

# Create the classification report for both models
from sklearn.metrics import classification_report
class_rep_tree = ...
class_rep_log = ...

print("Decision Tree: \n", class_rep_tree)
print("Logistic Regression: \n", class_rep_log)

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Train our logisitic regression
logreg = LogisticRegression(random_state=10)
logreg.fit(train_features, train_labels)
pred_labels_logit = logreg.predict(test_features)

# Create the classification report for both models
from sklearn.metrics import classification_report
class_rep_tree = classification_report(test_labels, pred_labels_tree)
class_rep_log = classification_report(test_labels, pred_labels_logit)

print("Decision Tree: \n", class_rep_tree)
print("Logistic Regression: \n", class_rep_log)

In [None]:
%%nose

def test_logreg():
    assert logreg.get_params() == LogisticRegression(random_state=10).get_params(), \
        'The logreg variable should be created using LogisticRegression().'

    
def test_logreg_pred():
    assert (pred_labels_logit == 'Rock').sum() == 1027, \
        'The labels should be predicted from the test_features.'
    
    
def test_class_rep_tree():
    assert class_rep_tree == ('             precision    recall  f1-score   support'
                              '\n\n    Hip-Hop       0.66      0.66      0.66       229'
                              '\n       Rock       0.92      0.92      0.92       972'
                              '\n\navg / total       0.87      0.87      0.87      1201\n'), \
        'Did you create the classification report correctly for the decision tree?'
    
    
def test_class_rep_log():
    assert class_rep_log == ('             precision    recall  f1-score   support'
                             '\n\n    Hip-Hop       0.75      0.57      0.65       229'
                             '\n       Rock       0.90      0.95      0.93       972'
                             '\n\navg / total       0.87      0.88      0.87      1201\n'), \
        'Did you create the classification report correctly for the logistic regression?'

## 8. Balance our data for greater performance

Both our models do similarly well, boasting an average precision of 87% each. However, looking at our classification report, we can see that rock songs are fairly well classified, but hip-hop songs are disproportionately misclassified as rock songs. 

Why might this be the case? Well, just by looking at the number of data points we have for each class, we see that we have far more data points for the rock classification than for hip-hop, potentially skewing our model's ability to distinguish between classes. This also tells us that most of our model's accuracy is driven by its ability to classify just rock songs, which is less than ideal.

To account for this, we can weight the value of a correct classification in each class inversely to the occurrence of data points for each class. Since a correct classification for "Rock" is not more important than a correct classification for "Hip-Hop" (and vice versa), we only need to account for differences in _sample size_ of our data points when weighting our classes here, and not relative importance of each class. 

Balance our dataset such that the number of tracks for each genre is the same.
- Subset only the hip-hop tracks from `echo_tracks` using `df.loc[]`, and the same for the rock tracks
- Sample `rock_only` such that there is the same number of data points as there are hip-hop data points. Set the random state to 10.
- Concatenate the `rock_only` and `hop_only` (in that order) DataFrames using the `pd.concat()` function by passing a list of these DataFrames.
- Redefine our train and test sets using `train_test_split` with the PCA projection of the balanced dataframe.

<hr>

Helpful links:
- pandas `DataFrame.loc[]` indexing [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)
- pandas `concat()` function [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html)
- pandas `DataFrame.sample()` method [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html)

You can filter rows of a dataframe named `df` which satisfy a boolean statement where the value in a column named `col1` is greater than 10, and then sample only 20 of these rows like this:

```python
df_filter_sample = df.loc[df['col1'] > 10].sample(20, random_state=0)
```

You can concatenate two DataFrames named `df1` and `df2` like this:

```python
df_concat = pd.concat([df1, df2])
```


In [None]:
# Subset only the hip-hop tracks, and then only the rock tracks
hop_only = ...
rock_only = ...

# sample the rocks songs to be the same number as there are hip-hop songs
rock_only = ...

# concatenate the dataframes rock_only and hop_only
rock_hop_bal = ...

# The features, labels, and pca projection are created for the balanced dataframe
features = rock_hop_bal.drop(['genre_top', 'track_id'], axis=1) 
labels = rock_hop_bal['genre_top']
pca_projection = pca.fit_transform(scaler.fit_transform(features))

# Redefine the train and test set with the pca_projection from the balanced data
train_features, test_features, train_labels, test_labels = train_test_split(..., random_state=10)

In [None]:
# Subset a balanced proportion of data points
hop_only = echo_tracks.loc[echo_tracks['genre_top'] == 'Hip-Hop']

# subset only the rock songs, and take a sample the same size as there are hip-hop songs
rock_only = echo_tracks.loc[echo_tracks['genre_top'] == 'Rock'].sample(hop_only.shape[0], random_state=10)

# concatenate the dataframes hop_only and rock_only
rock_hop_bal = pd.concat([rock_only, hop_only])

# The features, labels, and pca projection are created for the balanced dataframe
features = rock_hop_bal.drop(['genre_top', 'track_id'], axis=1) 
labels = rock_hop_bal['genre_top']
pca_projection = pca.fit_transform(scaler.fit_transform(features))

# Redefine the train and test set with the pca_projection from the balanced data
train_features, test_features, train_labels, test_labels = train_test_split(
    pca_projection, labels, random_state=10)

In [None]:
%%nose

def test_hop_only():
    try:
        pd.testing.assert_frame_equal(hop_only, echo_tracks.loc[echo_tracks['genre_top'] == 'Hip-Hop'])
    except AssertionError:
        assert False, "The hop_only data frame was not assigned correctly."
        

def test_rock_only():
    try:
        pd.testing.assert_frame_equal(
            rock_only, echo_tracks.loc[echo_tracks['genre_top'] == 'Rock'].sample(hop_only.shape[0], random_state=10))
    except AssertionError:
        assert False, "The rock_only data frame was not assigned correctly."
        
        
def test_rock_hop_bal():
    hop_only = echo_tracks.loc[echo_tracks['genre_top'] == 'Hip-Hop']
    rock_only = echo_tracks.loc[echo_tracks['genre_top'] == 'Rock'].sample(hop_only.shape[0], random_state=10)
    try:
        pd.testing.assert_frame_equal(
            rock_hop_bal, pd.concat([rock_only, hop_only]))
    except AssertionError:
        assert False, "The rock_hop_bal data frame was not assigned correctly."
        
        
def test_train_features():
    assert round(train_features[0][0], 4) == -0.7311 and round(train_features[-1][-1], 4) == 0.5624, \
    'The train_test_split was not performed correctly.'

## 9. Does balancing our dataset improve model bias?

We've now balanced our dataset, but in doing so, we've removed a lot of data points that might have been crucial to training our models. Let's test to see if balancing our data improves model bias towards the "Rock" classification while retaining overall classification performance. 

Note that we have already reduced the size of our dataset and will go forward without applying any dimensionality reduction. In practice, we would consider dimensionality reduction more rigorously when dealing with vastly large datasets and when computation times become prohibitively large.

Compare the two model performances on the balanced data.
- Create and train your decision tree using `DecisionTreeClassifier()` and a random state of `10`, then predict on the `test_features`.
- Create and train your logistic regression using `LogisticRegression()` and a random state of `10`, then predict on the `test_features`.
- Compare the performance of the two models using `classification_report()`. 

<hr>


You can predict labels on features called `my_features` using a scikit-learn object called `my_model` after fitting `my_model` on the training data, `train_features` and `train_labels` like this:

```python
my_model.fit(train_features, train_labels)
predicted = my_model.predict(my_features)
```

In [None]:
# Train our decision tree on the balanced data
tree = ...
...
pred_labels_tree = ...

# Train our logistic regression on the balanced data
logreg = ...
...
pred_labels_logit = ...

# Compare the models
print("Decision Tree: \n", classification_report(...))
print("Logistic Regression: \n", classification_report(...))

In [None]:
# Train our decision tree on the balanced data
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)
pred_labels_tree = tree.predict(test_features)

# Train our logistic regression on the balanced data
logreg = LogisticRegression(random_state=10)
logreg.fit(train_features, train_labels)
pred_labels_logit = logreg.predict(test_features)

# compare the models
print("Decision Tree: \n", classification_report(test_labels, pred_labels_tree))
print("Logistic Regression: \n", classification_report(test_labels, pred_labels_logit))

In [None]:
%%nose

def test_tree_bal():
    assert (pred_labels_tree == 'Rock').sum() == 226, \
    'The pred_labels_tree variable should contain the predicted labels from the test_features.'
    
    
def test_logit_bal():
    assert (pred_labels_logit == 'Rock').sum() == 221, \
    'The pred_labels_logit variable should contain the predicted labels from the test_features.'

## 10. Using cross-validation to evaluate our models

Success! Balancing our data has removed bias towards the more prevalent class. To get a good sense of how well our models are actually performing, we can apply what's called **cross-validation** (CV). This step allows us to compare models in a more rigorous fashion.

Since the way our data is split into train and test sets can impact model performance, CV attempts to split the data multiple ways and test the model on each of the splits. Although there are many different CV methods, all with their own advantages and disadvantages, we will use what's known as **K-fold** CV here. K-fold first splits the data into K different, equally sized subsets. Then, it iteratively uses each subset as a test set while using the remainder of the data as train sets. Finally, we can then aggregate the results from each fold for a final model performance score.

Use cross-validation to get a better sense of your model performance.
- Create a variable called `kf` to store your cv using `KFold()` with `10` folds and a random state of `10`.
- Train each of your models using `cross_val_score()`.
- Print the mean of the cross-validation scores for each model using `np.mean()`.

<hr>

Helpful links:
- `KFold()` [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
- `cross_val_score()` [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score)

You can train and cross-validate a scikit-learn model named `my_model` after defining our KFold cross-validation named `kf` like this:

```python
kf = KFold(5)
cv_score = cross_val_score(my_model, features, labels, cv=kf)
```

In [None]:
from sklearn.model_selection import KFold, cross_val_score

# Set up our K-fold cross-validation
kf = ...

tree = DecisionTreeClassifier(random_state=10)
logreg = LogisticRegression(random_state=10)

# Train our models using KFold cv
tree_score = ...
logit_score = ...

# Print the mean of each array of scores
print("Decision Tree:", ..., "Logistic Regression:", ...)

In [None]:
from sklearn.model_selection import KFold, cross_val_score

# Set up our K-fold cross-validation
kf = KFold(10, random_state=10)

tree = DecisionTreeClassifier(random_state=10)
logreg = LogisticRegression(random_state=10)

# Train our models using KFold cv
tree_score = cross_val_score(tree, pca_projection, labels, cv=kf)
logit_score = cross_val_score(logreg, pca_projection, labels, cv=kf)

# Print the mean of each array o scores
print("Decision Tree:", np.mean(tree_score), "Logistic Regression:", np.mean(logit_score))

In [None]:
%%nose

def test_kf():
    assert kf.__repr__() == 'KFold(n_splits=10, random_state=10, shuffle=False)', \
    'The k-fold cross-validation was not setup correctly.'
    
    
def test_tree_score():
    assert round((tree_score.sum() / tree_score.shape[0]), 4) == 0.7242, \
    'The tree_score was not calculated correctly.'
    
    
def test_log_score():
    assert round((logit_score.sum() / logit_score.shape[0]), 4) == 0.7753, \
    'The logit_score was not calculated correctly.'

*The recommended number of tasks in a DataCamp Project is between 8 and 10, so feel free to add more if necessary. You can't have more than 12 tasks.*