# Standardizing Stock Data

In this activity, you’ll use the `StandardScaler` module and clustering optimization techniques to cluster stocks. The purpose of clustering the stocks will be to define a portfolio investment strategy.

Instructions

1. Read in the `tsx-energy-2018.csv` file from the `Resources` folder and create the DataFrame. Make sure to set the `Ticker` column as the DataFrame’s index. Then review the DataFrame.

    > **Note** The stock data that’s provided for this activity contains the yearly mean prices (open, high, low, and close), volume, annual return, and annual variance from companies in the energy sector that the TSX lists.

2. To prepare the data, use the `StandardScaler` module and the `fit_transform` function to scale all the columns containing numerical values. Review a five-row sample of the scaled data using bracket notation ([0:5]).

3. Create a new DataFrame called `df_stocks_scaled` that contains the scaled data. Make sure to do the following: 

    - Use the same labels that were referenced in the `StandardScaler` for the column names. 
    - Add a column to the DataFrame that consists of the tickers from the original DataFrame. (Hint: This column was the index). 

    - Set the new column of tickers as the index for the new DataFrame. 

    - Review the resulting DataFrame. 

4. Encode the “EnergyType” column using `pd.get_dummies`, and save the result in a separate DataFrame called `df_oil_dummies`. Note that, because the company name isn’t relevant for clustering, you don’t need to encode the “CompanyName” column.

5. Using the `pd.concat` function, concatenate the `df_stocks_scaled` DataFrame with the `df_oil_dummies` DataFrame, along an axis value of 1 (`axis=1` tells Pandas to join the data horizontally by columns). Review the concatenated DataFrame. 

6. Using the concatenated DataFrame, cluster the data by using the K-means algorithm and a k value of 3. Create a copy of the concatenated DataFrame, and add the resulting list of company segment values as a new column. 



References

[scikit-learn StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

[scikit-learn Preprocessing Data](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-scaler)

[Pandas concat function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

[scikit-learn Python Library](https://scikit-learn.org)

In [None]:
# Import the required libraries and dependencies
import pandas as pd
from path import Path
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

## Step 1: Read in the `tsx-energy-2018.csv` file from the `Resources` folder and create the DataFrame. Make sure to set the `Ticker` column as the DataFrame’s index. Then review the DataFrame.

In [None]:
# Read the CSV file into a Pandas DataFrame
# Set the index using the Ticker column
df_stocks = # YOUR CODE HERE

# Review the DataFrame
# YOUR CODE HERE

## Step 2: To prepare the data, use the `StandardScaler` module and the `fit_transform` function to scale all the columns containing numerical values. Review a five row sample of the scaled data using bracket notation ([0:5]).

In [None]:
# Use the StandardScaler module and fit_transform function to 
# scale all columns with numerical values
stock_data_scaled = # YOUR CODE HERE

# Diplay the first five rows of the scaled data
# YOUR CODE HERE

## Step 3:  Create a new DataFrame called `df_stocks_scaled` that contains the scaled data. Make sure to do the following: 

- Use the same labels that were referenced in the `StandardScaler` for the column names. 
    
- Add a column to the DataFrame that consists of the tickers from the original DataFrame. (Hint: This column was the index). 

- Set the new column of tickers as the index for the new DataFrame. 

- Review the resulting DataFrame. 


In [None]:
# Create a DataFrame called with the scaled data
# The column names should match those referenced in the StandardScaler step
df_stocks_scaled = # YOUR CODE HERE

# Create a Ticker column in the df_stocks_scaled DataFrame
# using the index of the original df_stocks DataFrame
df_stocks_scaled["Ticker"] = # YOUR CODE HERE

# Set the newly created Ticker column as index of the df_stocks_scaled DataFrame
df_stocks_scaled = # YOUR CODE HERE

# Review the DataFrame
# YOUR CODE HERE

## Step 4: Encode the “EnergyType” column using `pd.get_dummies`, and save the result in a separate DataFrame called `df_oil_dummies`. Note that, because the company name isn’t relevant for clustering, you don’t need to encode the “CompanyName” column.

In [None]:
# Encode (convert to dummy variables) the EnergyType column
df_oil_dummies = # YOUR CODE HERE

# Review the DataFrame
# YOUR CODE HERE

## Step 5: Using the `pd.concat` function, concatenate the `df_stocks_scaled` DataFrame with the `df_oil_dummies` DataFrame, along an axis value of 1 (`axis=1` tells Pandas to join the data horizontally by columns). Review the concatenated DataFrame. 


In [None]:
# Concatenate the `EnergyType` encoded dummies with the scaled data DataFrame
df_stocks_scaled = # YOUR CODE HERE

# Display the sample data
# YOUR CODE HERE

## Step 6: Using the concatenated DataFrame, cluster the data by using the K-means algorithm and a k value of 3. Create a copy of the concatenated DataFrame, and add the resulting list of company segment values as a new column. 

In [None]:
# Initialize the K-Means model with n_clusters=3
model = # YOUR CODE HERE

In [None]:
# Fit the model for the df_stocks_scaled DataFrame
# YOUR CODE HERE

In [None]:
# Predict the model segments (clusters)
stock_clusters = # YOUR CODE HERE

# View the stock segments
# YOUR CODE HERE

In [None]:
# Create a copy of the concatenated DataFrame
df_stocks_scaled_predictions = # YOUR CODE HERE

In [None]:
# Create a new column in the copy of the concatenated DataFrame with the predicted clusters
df_stocks_scaled_predictions["StockCluster"] = # YOUR CODE HERE

# Review the DataFrame
# YOUR CODE HERE