# Transforming Data into Features

Transforming Data into Features
You are a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from Kaggle and has a lot of potential for various machine learning purposes. You are tasked with transforming some of these features to make the data more useful for analysis. To do this, you will have time to practice the following:

Transforming categorical data
Scaling your data
Working with date-time features
Let’s get started!

Tasks
16/16 complete
Mark the tasks as complete by checking them off
Basic Exploration
1.
Let’s start with some basic exploring by performing the following:

First, import your dataset. It is stored under a file named reviews.csv. Save it to a variable called reviews.

2.
Next, we want to look at the column names of our dataset along with their data types. Do the following two steps:

Print the column names of your dataset.
Check your features’ data types by printing .info().
Data Transformations
3.
Transform the recommended feature. Start by printing the feature’s .value_counts().

4.
Since this is a True/False feature, we want to transform it to 1 for True and 0 for False.

To do this, create a dictionary called binary_dict where:

The keys are what is currently in the recommended feature.
The values are what we want in the new column (0s and 1s).
Click the hint if you get stuck.

5.
Using binary_dict, transform the recommended column so that it will now be binary. Print the results using .value_counts() to confirm the transformation.

6.
Let’s run through a similar process to transform the rating feature. This is ordinal data so our transformation should make that more clear. Again, start by printing the .value_counts().

To check your output, click the hint.

7.
We want to make the following changes to the values:

‘Loved it’ → 5
‘Liked it’ → 4
‘Was okay’ → 3
‘Not great’ → 2
‘Hated it’ → 1
Create a dictionary called rating_dict where the keys are what is currently in the feature and the values are what we want in the new column. You can use the hierarchy listed above to make your dictionary.

8.
Using rating_dict, transform the rating column so it contains numerical values. Print the results using .value_counts() to confirm the transformation.

9.
Let’s now transform the department_name feature. This process will be slightly different, but start by printing the .value_counts() of the feature.

Use Panda’s get_dummies to one-hot encode our feature.
Attach the results back to our original data frame.
Print the column names to see!
10.
Use panda’s get_dummies() method to one-hot encode our feature. Assign this to a variable called one_hot.

11.
Join the results from one_hot back to our original data frame. Then print out the column names. What has been added?

12.
Let’s make one more feature transformation!

Transform the review_date feature.

This feature is listed as an object type, but we want this to be transformed into a date-time feature.

Transform review_date into a date-time feature.
Print the feature type to confirm the transformation.
Click the hint if you get stuck.

Scaling the Data
13.
The final step we will take in our transformation project is scaling our data. We notice that we have a wide range of numbers thus far, so it is best to put everything on the same scale.

Let’s get our data frame to only have the numerical features we created. If you get stuck, click the hint.

14.
Reset the index to be our clothing_id feature.

15.
We are ready to scale our data! Perform a .fit_transform() on our data set, and print the results to see how the features have changed.

16.
Congratulations!

You have successfully completed this transformation project. Transformations are an incredibly valuable skill to have. Great job!

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# 1. Load the dataset from 'reviews.csv' into a DataFrame called reviews
reviews = pd.read_csv('reviews.csv')

# 2. Print column names and data types
print("Column Names:", reviews.columns)
print("\nData Types:")
print(reviews.info())

# 3. Print value counts of the 'recommended' feature
print("\nRecommended Value Counts:")
print(reviews['recommended'].value_counts())

# 4. Create a dictionary to map True/False to 1/0
binary_dict = {True: 1, False: 0}

# 5. Apply the binary_dict to transform 'recommended' and confirm
reviews['recommended'] = reviews['recommended'].map(binary_dict)
print("\nTransformed Recommended Value Counts:")
print(reviews['recommended'].value_counts())

# 6. Print value counts of the 'rating' feature
print("\nRating Value Counts:")
print(reviews['rating'].value_counts())

# 7. Create a dictionary to map rating text to ordinal values
rating_dict = {
    'Loved it': 5,
    'Liked it': 4,
    'Was okay': 3,
    'Not great': 2,
    'Hated it': 1
}

# 8. Apply the rating_dict to transform 'rating' and confirm
reviews['rating'] = reviews['rating'].map(rating_dict)
print("\nTransformed Rating Value Counts:")
print(reviews['rating'].value_counts())

# 9. Print value counts of 'department_name' before one-hot encoding
print("\nDepartment Name Value Counts:")
print(reviews['department_name'].value_counts())

# 10. One-hot encode 'department_name' and store in one_hot
one_hot = pd.get_dummies(reviews['department_name'], prefix='dept')

# 11. Join one_hot back to reviews and print new column names
reviews = reviews.join(one_hot)
print("\nUpdated Column Names After One-Hot Encoding:")
print(reviews.columns)

# 12. Convert 'review_date' to datetime format and confirm
reviews['review_date'] = pd.to_datetime(reviews['review_date'])
print("\nReview Date Data Type After Conversion:")
print(reviews['review_date'].dtype)

# 13. Select only numerical features for scaling
numerical_features = reviews[['clothing_id', 'recommended', 'rating'] + list(one_hot.columns)]

# 14. Reset index to 'clothing_id'
numerical_features.set_index('clothing_id', inplace=True)

# 15. Scale the data using MinMaxScaler and print results
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(numerical_features)
print("\nScaled Data Sample:")
print(pd.DataFrame(scaled_data, columns=numerical_features.columns).head())


# Supervised Learning I : Regressors, Classifiers and Trees

In [1]:
Machine Learning/AI Engineer
Honey Production
Now that you have learned how linear regression works, let’s try it on an example of real-world data.

As you may have already heard, the honeybees are in a precarious state right now. You may have seen articles about the decline of the honeybee population for various reasons. You want to investigate this decline and how the trends of the past predict the future for the honeybees.

Note: All the tasks can be completed using Pandas or NumPy. Pick whichever one you prefer.

If you get stuck during this project or would like to see an experienced developer work through it, click “Get Unstuck“ to see a project walkthrough video.

Tasks
13/13 complete
Mark the tasks as complete by checking them off
Check out the Data
1.
We have loaded in a DataFrame for you about honey production in the United States from Kaggle. It is called df and has the following columns:

state
numcol
yieldpercol
totalprod
stocks
priceperlb
prodvalue
year
Use .head() to get a sense of how this DataFrame is structured.

2.
For now, we care about the total production of honey per year. Use the .groupby() method provided by pandas to get the mean of totalprod per year.

Store this in a variable called prod_per_year.

3.
Create a variable called X that is the column of years in this prod_per_year DataFrame.

After creating X, we will need to reshape it to get it into the right format, using this command:

X = X.values.reshape(-1, 1)

Copy to Clipboard

4.
Create a variable called y that is the totalprod column in the prod_per_year dataset.

5.
Using plt.scatter(), plot y vs X as a scatterplot.

Display the plot using plt.show().

Can you see a vaguely linear relationship between these variables?

Create and Fit a Linear Regression Model
6.
Create a linear regression model from scikit-learn and call it regr.

Use the LinearRegression() constructor from the linear_model module to do this.

7.
Fit the model to the data by using .fit(). You can feed X into your regr model by passing it in as a parameter of .fit().

8.
After you have fit the model, print out the slope of the line (stored in a list called regr.coef_) and the intercept of the line (regr.intercept_).

9.
Create a list called y_predict that is the predictions your regr model would make on the X data.

10.
Plot y_predict vs X as a line, on top of your scatterplot using plt.plot().

Make sure to call plt.show() after plotting the line.

Predict the Honey Decline
11.
So, it looks like the production of honey has been in decline, according to this linear model. Let’s predict what the year 2050 may look like in terms of honey production.

Our known dataset stops at the year 2013, so let’s create a NumPy array called X_future that is the range from 2013 to 2050. The code below makes a NumPy array with the numbers 1 through 10

nums = np.array(range(1, 11))

Copy to Clipboard

After creating that array, we need to reshape it for scikit-learn.

X_future = X_future.reshape(-1, 1)

Copy to Clipboard

You can think of reshape() as rotating this array. Rather than one big row of numbers, X_future is now a big column of numbers — there’s one number in each row.

reshape() is a little tricky! It might help to print out X_future before and after reshaping.

12.
Create a list called future_predict that is the y-values that your regr model would predict for the values of X_future.

13.
Plot future_predict vs X_future on a different plot.

How much honey will be produced in the year 2050, according to this?

SyntaxError: invalid syntax (1973196795.py, line 1)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# 1. Load the dataset from the provided URL
df = pd.read_csv("https://content.codecademy.com/programs/data-science-path/linear_regression/honeyproduction.csv")

# 2. Group by year and calculate the mean of total production
prod_per_year = df.groupby('year')['totalprod'].mean().reset_index()

# 3. Extract the 'year' column and reshape it for regression
X = prod_per_year['year'].values.reshape(-1, 1)

# 4. Extract the 'totalprod' column as the target variable
y = prod_per_year['totalprod'].values

# 5. Plot the data to visualize the relationship
plt.scatter(X, y)
plt.xlabel('Year')
plt.ylabel('Total Honey Production')
plt.title('Honey Production Over Time')
plt.show()

# 6. Create a linear regression model
regr = LinearRegression()

# 7. Fit the model to the data
regr.fit(X, y)

# 8. Print the slope and intercept of the regression line
print("Slope (Coefficient):", regr.coef_[0])
print("Intercept:", regr.intercept_)

# 9. Predict y values using the trained model
y_predict = regr.predict(X)

# 10. Plot the regression line over the scatterplot
plt.scatter(X, y, label='Actual')
plt.plot(X, y_predict, color='red', label='Regression Line')
plt.xlabel('Year')
plt.ylabel('Total Honey Production')
plt.title('Honey Production with Linear Regression')
plt.legend()
plt.show()

# 11. Create future years from 2013 to 2050 and reshape
X_future = np.array(range(2013, 2051)).reshape(-1, 1)

# 12. Predict future honey production
future_predict = regr.predict(X_future)

# 13. Plot future predictions
plt.plot(X_future, future_predict, color='green')
plt.xlabel('Year')
plt.ylabel('Predicted Honey Production')
plt.title('Predicted Honey Production (2013–2050)')
plt.show()

# Print predicted production for 2050
print("Predicted honey production in 2050:", future_predict[-1])


Tennis Ace
Overview
This project contains a series of open-ended requirements which describe the project you’ll be building. There are many possible ways to correctly fulfill all of these requirements, and you should expect to use the internet, Codecademy, and other resources when you encounter a problem.

Project Goals
You will create a linear regression model that predicts the outcome for a tennis player based on their playing habits. By analyzing and modeling the Association of Tennis Professionals (ATP) data, you will determine what it takes to be one of the best tennis players in the world.

Setup Instructions
If you choose to do this project on your computer instead of Codecademy, you can download what you’ll need by clicking the “Download” button below. If you need help setting up your computer, be sure to check out our setup guide.

Download
Tasks
7/8 complete
Mark the tasks as complete by checking them off
Prerequisites
1.
In order to complete this project, you should have completed the Linear Regression and Multiple Linear Regression lessons in the Machine Learning Course. This content is also covered in the Data Scientist Career Path.

Project Requirements
2.
“Game, Set, Match!”

No three words are sweeter to hear as a tennis player than those, which indicate that a player has beaten their opponent. While you can head down to your nearest court and aim to overcome your challenger across the net without much practice, a league of professionals spends day and night, month after month practicing to be among the best in the world. Today you will put your linear regression knowledge to the test to better understand what it takes to be an all-star tennis player.

Provided in tennis_stats.csv is data from the men’s professional tennis league, which is called the ATP (Association of Tennis Professionals). Data from the top 1500 ranked players in the ATP over the span of 2009 to 2017 are provided in file. The statistics recorded for each player in each year include service game (offensive) statistics, return game (defensive) statistics and outcomes. Load the csv into a DataFrame and investigate it to gain familiarity with the data.

Open the hint for more information about each column of the dataset.

3.
Perform exploratory analysis on the data by plotting different features against the different outcomes. What relationships do you find between the features and outcomes? Do any of the features seem to predict the outcomes?

4.
Use one feature from the dataset to build a single feature linear regression model on the data. Your model, at this point, should use only one feature and predict one of the outcome columns. Before training the model, split your data into training and test datasets so that you can evaluate your model on the test set. How does your model perform? Plot your model’s predictions on the test set against the actual outcome variable to visualize the performance.

5.
Create a few more linear regression models that use one feature to predict one of the outcomes. Which model that you create is the best?

6.
Create a few linear regression models that use two features to predict yearly earnings. Which set of two features results in the best model?

7.
Create a few linear regression models that use multiple features to predict yearly earnings. Which set of features results in the best model?

Head to the Codecademy forums and share your set of features that resulted in the highest test score for predicting your outcome. What features are most important for being a successful tennis player?

Solution
8.
Great work! Visit our forums to compare your project to our sample solution code. You can also learn how to host your own solution on GitHub so you can share it with other learners! Your solution might look different from ours, and that’s okay! There are multiple ways to solve these projects, and you’ll learn more by seeing others’ code.

# Tennis Ace => see download

Predict Credit Card Fraud
Credit card fraud is one of the leading causes of identify theft around the world. In 2018 alone, over $24 billion were stolen through fraudulent credit card transactions. Financial institutions employ a wide variety of different techniques to prevent fraud, one of the most common being Logistic Regression.

In this project, you are a Data Scientist working for a credit card company. You have access to a dataset (based on a synthetic financial dataset), that represents a typical set of credit card transactions. transactions.csv is the original dataset containing 200k transactions. For starters, we’re going to be working with a small portion of this dataset, transactions_modified.csv, which contains one thousand transactions. Your task is to use Logistic Regression and create a predictive model to determine if a transaction is fraudulent or not.

Note that a solution.py file is loaded for you in the workspace, which contains solution code for this project. We highly recommend that you complete the project on your own without checking the solution, but feel free to take a look if you get stuck or want to check your answers when you’re done!

Tasks
17/17 complete
Mark the tasks as complete by checking them off
Load the Data
1.
The file transactions_modified.csv contains data on 1000 simulated credit card transactions. Let’s begin by loading the data into a pandas DataFrame named transactions. Take a peek at the dataset using .head() and you can use .info() to examine how many rows are there and what datatypes the are. How many transactions are fraudulent? Print your answer.

Clean the Data
2.
Looking at the dataset, combined with our knowledge of credit card transactions in general, we can see that there are a few interesting columns to look at. We know that the amount of a given transaction is going to be important. Calculate summary statistics for this column. What does the distribution look like?

3.
We have a lot of information about the type of transaction we are looking at. Let’s create a new column called isPayment that assigns a 1 when type is “PAYMENT” or “DEBIT”, and a 0 otherwise.

4.
Similarly, create a column called isMovement, which will capture if money moved out of the origin account. This column will have a value of 1 when type is either “CASH_OUT” or “TRANSFER”, and a 0 otherwise.

5.
With financial fraud, another key factor to investigate would be the difference in value between the origin and destination account. Our theory, in this case, being that destination accounts with a significantly different value could be suspect of fraud. Let’s create a column called accountDiff with the absolute difference of the oldbalanceOrg and oldbalanceDest columns.

Select and Split the Data
6.
Before we can start training our model, we need to define our features and label columns. Our label column in this dataset is the isFraud field. Create a variable called features which will be an array consisting of the following fields:

amount
isPayment
isMovement
accountDiff
Also create a variable called label with the column isFraud.

7.
Split the data into training and test sets using sklearn‘s train_test_split() method. We’ll use the training set to train the model and the test set to evaluate the model. Use a test_size value of 0.3.

Normalize the Data
8.
Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

Create and Evaluate the Model
9.
Create a LogisticRegression model with sklearn and .fit() it on the training data.

Fitting the model find the best coefficients for our selected features so it can more accurately predict our label. We will start with the default threshold of 0.5.

10.
Run the model’s .score() method on the training data and print the training score.

Scoring the model on the training data will process the training data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy.

11.
Run the model’s .score() method on the test data and print the test score.

Scoring the model on the test data will process the test data through the trained model and will predict which transactions are fraudulent. The score returned is the percentage of correct classifications, or the accuracy, and will be an indicator for the sucess of your model.

How did your model perform?

12.
Print the coefficients for our model to see how important each feature column was for prediction. Which feature was most important? Least important?

Predict With the Model
13.
Let’s use our model to process more transactions that have gone through our systems. There are three numpy arrays pre-loaded in the workspace with information on new sample transactions under “New transaction data”

# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])

Copy to Clipboard

Create a fourth array, your_transaction, and add any transaction information you’d like. Make sure to enter all values as floats with a .!

14.
Combine the new transactions and your_transaction into a single numpy array called sample_transactions.

15.
Since our Logistic Regression model was trained on scaled feature data, we must also scale the feature data we are making predictions on. Using the StandardScaler object created earlier, apply its .transform() method to sample_transactions and save the result to sample_transactions.

16.
Which transactions are fraudulent? Use your model’s .predict() method on sample_transactions and print the result to find out.

Want to see the probabilities that led to these predictions? Call your model’s .predict_proba() method on sample_transactions and print the result. The 1st column is the probability of a transaction not being fraudulent, and the 2nd column is the probability of a transaction being fraudulent (which was calculated by our model to make the final classification decision).

17.
Congratulations on completing the project!

Note that we’d used a modified version of the dataset. You can now try to re-run the project using the original dataset, transactions.csv. Examine how the results change. If you notice something weird, you’re totally on to something! That “something” is what is known as an imbalanced class classification problem.

We will cover this very relevant topic (among many other things) in the Logistic Regression II module!

In [None]:
import seaborn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import codecademylib3

# Load the data
transactions = pd.read_csv('transactions_modified.csv')
print(transactions.head())
print(transactions.info())

# Summary statistics on amount column
transactions['amount'].describe()

# Create isPayment field
transactions['isPayment'] = 0
transactions['isPayment'][transactions['type'].isin(['PAYMENT','DEBIT'])] = 1

# Create isMovement field
transactions['isMovement'] = 0
transactions['isMovement'][transactions['type'].isin(['CASH_OUT', 'TRANSFER'])] = 1

# Create accountDiff field
transactions['accountDiff'] = abs(transactions['oldbalanceDest'] - transactions['oldbalanceOrg'])

# Create features and label variables
features = transactions[['amount','isPayment','isMovement','accountDiff']]
label = transactions['isFraud']

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    label, 
                                                    test_size=0.3)

# Normalize the features variables
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Fit the model to the training data
model = LogisticRegression()
model.fit(X_train, y_train)

# Score the model on the training data
print(model.score(X_train, y_train))

# Score the model on the test data
print(model.score(X_test, y_test))

# Print the model coefficients
print(model.coef_)

# New transaction data
transaction1 = np.array([123456.78, 0.0, 1.0, 54670.1])
transaction2 = np.array([98765.43, 1.0, 0.0, 8524.75])
transaction3 = np.array([543678.31, 1.0, 0.0, 510025.5])

# Create a new transaction
your_transaction = np.array([6472.54, 1.0, 0.0, 55901.23])

# Combine new transactions into a single array
sample_transactions = np.stack((transaction1,transaction2,transaction3,your_transaction))

# Normalize the new transactions
sample_transactions = scaler.transform(sample_transactions)

# Predict fraud on the new transactions
print(model.predict(sample_transactions))

# Show probabilities on the new transactions
print(model.predict_proba(sample_transactions))

Cancer Classifier
In this project, we will be using several Python libraries to make a K-Nearest Neighbor classifier that is trained to predict whether a patient has breast cancer.

If you get stuck during this project or would like to see an experienced developer work through it, click “Get Unstuck“ to see a project walkthrough video.

Tasks
18/18 complete
Mark the tasks as complete by checking them off
Explore the data
1.
Let’s begin by importing the breast cancer data from sklearn. We want to import the function load_breast_cancer from sklearn.datasets.

Once we’ve imported the dataset, let’s load the data into a variable called breast_cancer_data. Do this by setting breast_cancer_data equal to the function load_breast_cancer().

2.
Before jumping into creating our classifier, let’s take a look at the data. Begin by printing breast_cancer_data.data[0]. That’s the first datapoint in our set. But what do all of those numbers represent? Let’s also print breast_cancer_data.feature_names.

3.
We now have a sense of what the data looks like, but what are we trying to classify? Let’s print both breast_cancer_data.target and breast_cancer_data.target_names.

Was the very first data point tagged as malignant or benign?

Splitting the data into Training and Validation Sets
4.
We have our data, but now it needs to be split into training and validation sets. Luckily, sklearn has a function that does that for us. Begin by importing the train_test_split function from sklearn.model_selection.

5.
Call the train_test_split function. It takes several parameters:

The data you want to split (for us breast_cancer_data.data)
The labels associated with that data (for us, breast_cancer_data.target).
The test_size. This is what percentage of your data you want to be in your testing set. Let’s use test_size = 0.2
random_state. This will ensure that every time you run your code, the data is split in the same way. This can be any number. We used random_state = 100.
6.
Right now we’re not storing the return value of train_test_split. train_test_split returns four values in the following order:

The training set
The validation set
The training labels
The validation labels
Store those values in variables named training_data, validation_data, training_labels, and validation_labels.

7.
Let’s confirm that worked correctly. Print out the length of training_data and training_labels. They should be the same size - one label for every piece of data!

Running the classifier
8.
Now that we’ve created training and validation sets, we can create a KNeighborsClassifier and test its accuracy. Begin by importing KNeighborsClassifier from sklearn.neighbors.

9.
Create a KNeighborsClassifier where n_neighbors = 3. Name the classifier classifier.

10.
Train your classifier using the fit function. This function takes two parameters: the training set and the training labels.

11.
Now that the classifier has been trained, let’s find how accurate it is on the validation set. Call the classifier’s score function. score takes two parameters: the validation set and the validation labels. Print the result!

12.
The classifier does pretty well when k = 3. But maybe there’s a better k! Put the previous 3 lines of code inside a for loop. The loop should have a variable named k that starts at 1 and increases to 100. Rather than n_neighbors always being 3, it should be this new variable k.

You should now see 100 different validation accuracies print out. Which k seems the best?

Graphing the results
13.
We now have the validation accuracy for 100 different ks. Rather than just printing it out, let’s make a graph using matplotlib. Begin by importing matplotlib.pyplot as plt.

14.
The x-axis should be the values of k that we tested. This should be a list of numbers between 1 and 100. You can use the range function to make this list. Store it in a variable named k_list.

15.
The y-axis of our graph should be the validation accuracy. Instead of printing the validation accuracies, we want to add them to a list. Outside of the for loop, create an empty list named accuracies. Inside the for loop, instead of printing each accuracy, append it to accuracies.

16.
We can now plot our data! Call plt.plot(). The first parameter should be k_list and the second parameter should be accuracies.

After plotting the graph, show it using plt.show().

17.
Let’s add some labels and a title. Set the x-axis label to "k" using plt.xlabel(). Set the y-axis label to "Validation Accuracy". Set the title to "Breast Cancer Classifier Accuracy".

18.
Great work! If you want to play around with this more, try changing the random_state parameter when making the training set and validation set. This will change which points are in the training set and which are in the validation set.

Ideally, the graph will look the same no matter how you split up the training set and test set. This data set is fairly small, so there is slightly more variance than usual.

In [None]:
import codecademylib3
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

#https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data
cols = ['name','landmass','zone', 'area', 'population', 'language','religion','bars','stripes','colours',
'red','green','blue','gold','white','black','orange','mainhue','circles',
'crosses','saltires','quarters','sunstars','crescent','triangle','icon','animate','text','topleft','botright']
df= pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/flags/flag.data", names = cols)

#variable names to use as predictors
var = [ 'red', 'green', 'blue','gold', 'white', 'black', 'orange', 'mainhue','bars','stripes', 'circles','crosses', 'saltires','quarters','sunstars','triangle','animate']

#Print number of countries by landmass, or continent
print(df.landmass.value_counts())

#Create a new dataframe with only flags from Europe and Oceania
df_36 = df[df["landmass"].isin([3,6])]

#Print the average vales of the predictors for Europe and Oceania
print(df_36.groupby('landmass')[var].mean().T)

#Create labels for only Europe and Oceania
df_36 = df[df["landmass"].isin([3,6])]
labels = df_36["landmass"]

#Print the variable types for the predictors
print(df[var].dtypes)

#Create dummy variables for categorical predictors
data = pd.get_dummies(df_36[var])

#Split data into a train and test set
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state=1, test_size=.4)

#Fit a decision tree for max_depth values 1-20; save the accuracy score in acc_depth
depths = range(1, 21)
acc_depth = []
for i in depths:
    dt = DecisionTreeClassifier(random_state = 10, max_depth = i)
    dt.fit(train_data, train_labels)
    acc_depth.append(dt.score(test_data, test_labels))

#Plot the accuracy vs depth
plt.plot(depths, acc_depth)
plt.xlabel('max_depth')
plt.ylabel('accuracy')
plt.show()

#Find the largest accuracy and the depth this occurs
max_acc = np.max(acc_depth)
best_depth = depths[np.argmax(acc_depth)]
print(f'Highest accuracy {round(max_acc,3)*100}% at depth {best_depth}')

#Refit decision tree model with the highest accuracy and plot the decision tree
plt.figure(figsize=(14,8))
dt = DecisionTreeClassifier(random_state = 1, max_depth = best_depth)
dt.fit(train_data, train_labels)
tree.plot_tree(dt, feature_names = train_data.columns,  
               class_names = ['Europe', 'Oceania'],
                filled=True)
plt.show()

#Create a new list for the accuracy values of a pruned decision tree.  Loop through
#the values of ccp and append the scores to the list
acc_pruned = []
ccp = np.logspace(-3, 0, num=20)
for i in ccp:
    dt_prune = DecisionTreeClassifier(random_state = 1, max_depth = best_depth, ccp_alpha=i)
    dt_prune.fit(train_data, train_labels)
    acc_pruned.append(dt_prune.score(test_data, test_labels))

plt.plot(ccp, acc_pruned)
plt.xscale('log')
plt.xlabel('ccp_alpha')
plt.ylabel('accuracy')
plt.show()

#Find the largest accuracy and the ccp value this occurs
max_acc_pruned = np.max(acc_pruned)
best_ccp = ccp[np.argmax(acc_pruned)]

print(f'Highest accuracy {round(max_acc_pruned,3)*100}% at ccp_alpha {round(best_ccp,4)}')

#Fit a decision tree model with the values for max_depth and ccp_alpha found above
dt_final = DecisionTreeClassifier(random_state = 1, max_depth = best_depth, ccp_alpha=best_ccp)
dt_final.fit(train_data, train_labels)

#Plot the final decision tree
plt.figure(figsize=(14,8))
tree.plot_tree(dt_final, feature_names = train_data.columns,  
               class_names = ['Europe', 'Oceania'],
                filled=True)
plt.show()
