In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math

# Task 1: Guessing the name of a color by Bob

For this task we will need to:
- input several colors in RGB format
- clean our data set
- remove duplicate RGB values in the data
- train a KNN model
- use n-fold cross validation to evaluate our model
- find which k value results in the most accurate KNN model
- produce a performance report with accuracy and visualizations


To do the above we will implement a combination of different functions with libraries 

First is creating a function that allows the user (Alice) to input N sets of RGB values

In [None]:
'''
A function is used here to allow for ease of reusibility 
'''
def colour_inputs(n):
    '''
    Inputs:
     - n : the number of rgb sets that will be entered
    
    Outputs:
     - A list of all the rgb values in the format[(R1,G2,B3).. (Rn,Gn,Bn)]
     
    Uses a for loop from 0 to n which asks the user to respectively input 
    the R,G,B values. At the end of each loop iteration it will add the values 
    to the Output list
    '''
    
    RGB_Inputs = [] # initializes an empty list
    
    for i in range(n):
        R = int(input("Enter the R value for the color"))
        G = int(input("Enter the G value for the color"))
        B = int(input("Enter the B value for the color"))
        RGB_Inputs.append([R,G,B]) #adds the values to the list
    
    return RGB_Inputs

After this function we must import our dataset and prep it for analysis

In [None]:
'''
Use the pd.read_csv method to read in the file, then skips the first 3 lines
which are comments detailing the file, finally set the 4th line as the column
names as it details what each column means
'''
colors = pd.read_csv('colour_naming_data-1.csv', skiprows = 3, 
                    header =1)

In [None]:
#quickly inspect the file to make sure it is read in correctly
colors.head(20)

In [None]:
colors.tail(20)

From this quick inspection, I can see that punctuation needs to be removed and that the capitalization also needs to be standardized. So I will proceed by creating a clean function. 

In [None]:
def clean(data, labels):
    '''
    Input: Dataframe name and a list of all column names
    Output: cleaned dataframe
    Works by looping through each column, if the column is of type string
    then it will lower all the words and remove any excess space. For all
    columns it will remove all punctuation through the use of regex operators
    '''
    #Loops through the columns
    for i in labels:
        #if statement to check if the column is of type string
        if data[i].dtype == 'object':
            #uses a lambda function to lower and strip all the values
            data[i] = data[i].apply(lambda x: x.lower().strip())
            
        #Line of code that removes punctuation  
        data[i] = data[i].replace('[^\w\s]', ' ', regex=True)
    
    return data

In [None]:
cleaned_colors = clean(colors, ['sample_id', 'colour_name', 'R', 'G', 'B'])

In [None]:
#Quickly reinspect the data with the new cleaned df
cleaned_colors.head(20)

In [None]:
cleaned_colors.tail(20)

Since our code is now cleaned we must now remove all duplicate RGB values, and set the remaining values to the most common color name associated with it

In [None]:
'''
To do so we will create a new data frame that is grouped by RGB values,
then we will use the value_counts and idxmax functions to find what color
is most common for that rgb value. Finally by using reset_index we put the 
indices back to default

These functions were found by looking through the pandas documentation 
and seeing how they affect a dataframe. 
'''
#first uses the groupby function to arrange the original dataframe by RGB
colors_df = cleaned_colors.groupby(['R','G','B'])

#Then uses a lambda function to find what color name is most used for that rgb value
colors_df = colors_df['colour_name'].apply(lambda x: 
                                        x.value_counts().idxmax())

#finally resets the indices 
colors_df = colors_df.reset_index()

In [None]:
colors_df

Now our data is fully ready to train a model. I also notice that some colors repeat several times which leads me to predict that when we train our model, it will be slighlty bias and overselect these colors that repeat. 

We will create a function to create a knn model, evaluate using n-fold cross validation and then finally return some performance metrics

In [None]:
from sklearn.model_selection import LeaveOneOut
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics

In [None]:
def training(n, x, y):
    '''
    Input: 
     - n: the number of neighbors for the kNN model
     - x: the x-value from our dataset (rgb_representations)
     - y: the y-value from our dataset (color_names)
     
     Output: 
     - predictions
     - accuracy score
     - weighted precision 
     - weighted recall
     - weighted f1
     
     This function works by first creating the kNN model of k(n). Then by using the LeaveOneOut classifier,
     we train our knn model on every single data point. Then we use the X_Test set created by the classifier 
     to predict what color_name is associated with those rgb values. Finally we use the predictions generated
     by the knn and the sklearn metrics library to generate values for accuracy, precision, recall, and f1. 
    '''
   
    knn = KNeighborsClassifier(n_neighbors = n) # initializes the kNN model with k=n
    loo = LeaveOneOut() # initializes the leaveOneOut classifier
    
    #creates empty variable to later store the values generated from the classification/training
    predictions = []
    actual_labels = []
    ac=0
    wp=0
    wr=0
    wf=0
    
    #for loop to go through every-single data point in x
    for train_index, test_index in loo.split(x):
        #Assigns a training set and test set
        X_Train, X_Test = x[train_index], x[test_index]
        Y_Train, Y_Test = y[train_index], y[test_index]
        
        #uses the training set to train the knn model
        knn.fit(X_Train, Y_Train)
        
        #then uses the x_test variable to generate predictions on a value for Y
        predicted = knn.predict(X_Test)
        
        #adds the prediction values to 'predictions' and the actual values (y-values) to 'actual_labels'
        predictions.extend(predicted)
        actual_labels.extend(Y_Test)
    
   
    #uses the sklearn.metrics library to generate performance metrics
    ac = metrics.accuracy_score(actual_labels, predictions)
    '''
    the average is weighted to account for the different classifications of colors
    the zero_division =0.0 helps to avoid the issues that arise because there are 0s in the set by setting 
    anytime that a 0/0 appears to just equal 0
    '''
    wp = metrics.precision_score(actual_labels,predictions,
                                 average = 'weighted', zero_division = 0.0)
    
    wr = metrics.recall_score(actual_labels, predictions, 
                              average = 'weighted', zero_division = 0.0)
    
    wf = metrics.f1_score(actual_labels, predictions, 
                          average = 'weighted', zero_division = 0.0)
    
    return predictions,ac, wp, wr, wf

# Hyper-parameter tuning and generating a performance report

We are going to test our kNN model for all k-values from 1-12 and compare all the metrics that are generated as a result to find which model is most accurate.

In [None]:
#Creates empty lists to store all the metrics/values that are generated
predictions = []
accuracy = []
weighted_precision = []
weighted_recall = []
weighted_f1 = []

#converts the RGB values and the color names to numpy arrays so they can be ran through the function
x = colors_df[['R', 'G', 'B']].to_numpy()
y = colors_df['colour_name'].to_numpy()

#Runs a for loop from 1-12 
for i in range(1,13):
    #runs the training function
    p, ac, wp, wr, wf = training(i,x,y) 
    #adds all the generated values to the lists
    predictions.append(p)
    accuracy.append(ac)
    weighted_precision.append(wp)
    weighted_recall.append(wr)
    weighted_f1.append(wf)
    
    #Generates a small report on each value
    print(f"\nResults for K: {i}\nAccuracy: {ac}\nWeighted Precision: {wp}")
    print(f"Weighted Recall: {wr}\nWeighted F1: {wf}")

In [None]:
#Finds for k-value was the accuracy the greatest 
greatest_accuracy_index = accuracy.index(max(accuracy))
print(f"The k-value with the greatest accuracy was {greatest_accuracy_index+1}")

In [None]:
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

Now we will use the plotly library to create plots that show how each metric varies dependant on the k-value. I'm using plotly instead of matplotlib for its ease of readability and interactive functionality. 

In [None]:
#Makes a 2x2 set of subplots for each metric
fig = make_subplots(rows = 2, cols = 2,
                    subplot_titles = ("Accuracy", "Weighted Precision", "Weighted Recall", "Weighted F1"),
                   horizontal_spacing = 0.4)

#creates a list of numbers from 1-12 to serve as our x-axis
X = [*range(1,13)]

#plots all the points
fig.add_trace(
    go.Scatter(x=X, y = accuracy, name = "Accuracy Scores"),
    row = 1, col =1
)

fig.add_trace(
    go.Scatter(x=X, y = weighted_precision, name = "Weighted Precision Scores"),
    row = 1, col = 2
)

fig.add_trace(
    go.Scatter(x=X, y = weighted_recall, name = "Weighted Recall Scores"),
    row = 2, col = 1
)

fig.add_trace(
    go.Scatter(x=X, y = weighted_f1, name = "Weighted F1 Scores"),
    row = 2, col = 2
)

#Adds title and adjusts size
fig.update_layout(height = 600, width = 600, title_text = "Performance Report")

#Labels x-axes
fig.update_xaxes(title_text = "K_Values", row = 1, col = 1)
fig.update_xaxes(title_text = "K_Values", row = 1, col = 2)
fig.update_xaxes(title_text = "K_Values", row = 2, col = 1)
fig.update_xaxes(title_text = "K_Values", row = 2, col = 2)

#Labels y-axes
fig.update_yaxes(title_text = "Accuracy of kNN", row = 1, col = 1)
fig.update_yaxes(title_text = "Precision of kNN", row = 1, col = 2)
fig.update_yaxes(title_text = "Recall of kNN", row = 2, col = 1)
fig.update_yaxes(title_text = "F1 of kNN", row = 2, col = 2)

fig.show()



## What I notice

All the graphs seem to increase from a higher k-value, except for the precision.

In [None]:
#adds a column to the color_df to have all the predictions from the highest accuracy model side-by-side
colors_df['predictions'] = predictions[greatest_accuracy_index]

In [None]:
colors_df

In [None]:
'''
Uses the plotly library to create a 3D scatter plot where colors are plotted based on their RGB-value 
Then they are labeled based on what the kNN model of the greatest accuracy predicted their values to be. 
'''
fig = px.scatter_3d(colors_df, x = 'R', y = 'G', z = 'B',
                   color = 'colour_name', text = 'predictions')

fig.update_traces(textposition = 'top center')

fig.update_layout(
    height = 600,
    width = 800,
    title_text = "Predicted Colors and their Actual Actual Color"
)

fig.show()

## What I notice

There's alot of repeats in predictions which is most likely due to how many colors repeated in the dataframe. But also it was these predictions that were the most accurate since it appears so much. The ones that were least accurate had to be the points that were synonyms of a color (ie: moave, violet). The program had often just assigned the general name to these colors. 

# Task 2: Assessing color names

For task 2 we will need to:
 - Make a set of N=10 colors in RGB format
 - measure the distance between the test colors and predicted names
 - create a weighted graph of color synonyms

In [None]:
#Uses the previously created colour_inputs function to create a list of 10 RGB values
alice = colour_inputs(10)

In [None]:
def rgb_centroid(data, value_column , value):
    '''
    Input:
     - Data: Dataframe 
     - value_column: the column to search in
     - value: the color to search for
    
    Output:
     - A centroid for the inputted value 
    
    This function works by first located all the rgb_values that pertain
    to the inputted color_name. Then it averages out all those rgb value
    to determine a centroid for that color
    '''
    
    #Creates empty variables to store the centroid  RGB values
    R,G,B = 0,0,0
    
    #sorts and stores the dataframe to only have the value we are looking for
    rgb_values = data.loc[data[value_column] == value]
    #stores the length of our small dataframe
    length = len(rgb_values)
    
    #a for-loop that goes through our small dataframe and adds together 
    #the R,G,B values respectively
    for index, row in rgb_values.iterrows():
        R += row['R']
        G += row['G']
        B += row['B']
    
    #averages each RGB Value and then returns it in a list
    centr = [R/length, G/length, B/length]
    return np.array(centr)

In [None]:
def find_color(data,value_column,R,G,B):
    '''
    Input: 
     - data: Dataframe
     - value_column: the column to search in
     - R,G,B: the RGB Values
    Output:
     - The color associated with that RGB value
     
     This function is created to allow for an easier way to search 
     our dataframe for a color_label based on inputted rgb values. It
     works by first sorting the dataframe to only have the value with the
     exact same RGB value, then converts the df column into a list and 
     returns the color value
    '''
    #Sorts and stores the dataframe based on RGB
    a = data.loc[(data['R'] == R) & 
                 (data['G'] == G) & 
                 (data['B'] == B)]
    #converts and stores the sorted dataframe column that we're looking for
    b = list(a[value_column])
    
    #returns the color value
    return b[0]

In [None]:
def euclidean(a,b):
    '''
    Input: 
     - a: first vector
     - b: second vector
    Output:
     - euclidean distance between a and b
    Quick function to make it easier to calculate euclidean distance
    '''
    return np.linalg.norm(a-b)

In [None]:
'''
Now we need to calculate the distance between Alice's actual values
and Bob's predicted values
'''

#Empty list to later store the distance values
distance = []

#A for loop to iterate through the list of alice's inputs
for i in alice:
    #First finds the predictions and actual correlated to the 
    #rgb values that alice inputted
    predicted_color = find_color(colors_df, 
                                 'predictions', 
                                 i[0], i[1], i[2])
    
    alice_color = find_color(colors_df,
                             'colour_name', 
                             i[0], i[1], i[2])
    
    #Then calculates the centroid of that predicted color
    predicted_centroid = rgb_centroid(colors_df, 
                                      'predictions', 
                                      predicted_color)
    
    #Finally finds the euclidean distance between the array of
    #alice's values and array of the centroid, and also appends it to the distance list
    dist = euclidean(np.array(i),predicted_centroid)
    distance.append(dist)
    
    #A small print statement to show what color Alice put, what Bob predicted
    #and the distance between them
    print(f"\nThe distance between Alice's {alice_color} and Bob's {predicted_color} is {dist}")

In [None]:
#Creates a dataframe to store the euclidean distances and trial number
results_df = pd.DataFrame(distance, columns = ['Euclidean Distance'])
results_df['Color #'] = range(1,11)

In [None]:
#plots the dataframe with x being the trial number and y being 
#the euclidean distance
fig = px.bar(results_df, x= 'Color #', y= 'Euclidean Distance', 
             title= "Distance between Bob's Predicted Colors and Alice's Actual Colors")
fig.show()

## What I notice

The graph looks for the most part very symmetric which most likely is due to the order that I inputted the rgb values. I predicted that the times where the program correctly guessed the color would have a distance of 0, however (like in trial #10) they often had a large distance. This is because of the fact we are taking the centroid and the same color labels appear many times. 

# Creating a weighted graph

In [None]:
import networkx as nx
import collections as col

In [None]:
class Graph:
    '''
    This class is to allow for an easier use of the networkx library and 
    its functionalities of adding vertices, edges, and drawing graphs.
    '''
    
    def __init__(self, V = [], E = []):
        
        '''
        this init function reads a list of vertices and a list of edges, 
        it then proceeds to add the vertices and edges to the graph.
        '''
        
        #declares a networkx graph
        self.G = nx.Graph()
        
        #iterates through the vertex list and adds them to the graph
        for v in V:
            self.add_vertex(v)
        
        #iterates through the edge list and adds an edge and weight between
        #the two vertices
        for u, v, w in E:
            self.add_edge(u, v, w)
         
    def add_vertex(self, v):
        '''
        This add_vertex function takes in an input of a vertex then 
        checks if the vertex is in the graph. If not, it proceeds to 
        add a new node of v into the graph. 
        '''
        if v not in self.G:
            self.G.add_node(v)
        
    def add_edge(self, u, v, w):
        '''
        This add_edge function takes in 3 values. u represents the first node,
        v represents the second node, and w represents the weight 
        between the nodes. It works by adding the two vertices if they're not
        already in the graph, then checks if u and v are equal. If they aren't
        then it adds an edge between the two vertices with weight w. 
        The reason why this check is important is so we don't get a vertex
        going to itself.
        '''
        self.add_vertex(u)
        self.add_vertex(v)
        
        if u != v:
            self.G.add_edge(u,v, weight = w)

        
    def visualize(self):
        '''
        This visualize function utilizes the nx.draw() function to create
        a graph of certain specifications. It works by creating a plot of
        20x10 dimensions, then finding the values of the edges and weights,
        then creating a spring_layout to space out the nodes, finally plots
        the nodes and edges with edge_color changing depending on weight.
        '''
        plt.figure(figsize = (20,10))
        
        
        edges,weights = zip(*nx.get_edge_attributes(self.G,
                                                   'weight').items())
        
        pos = nx.spring_layout(self.G, scale = 1.5)
        nx.draw(self.G, pos, edgelist = edges, 
                node_color = "#ffffff",
                edge_color = weights, with_labels = True, 
                node_size = 1500, font_size = 8)

In [None]:
#Creates a list of all unique color names
vertices = list(set(colors_df['colour_name']))

In [None]:
#Creates a list of all the edges and weights 
edges = [] 
#loops through the list of vertices
for i in vertices:
    #finds the centroid for color1
    centr1 = rgb_centroid(colors_df, 'colour_name', i)
    #loops through list again to get the second vertex
    for j in vertices:
        #finds the centroid for color2
        centr2 = rgb_centroid(colors_df, 'colour_name', j)
        
        #Adds to the list of all the edges and weights
        edges.append((i,j,euclidean(centr1,centr2)))

In [None]:
#creates a Graph G
G = Graph(vertices, edges)

In [None]:
G.visualize()

# Reflection on AE2

AE2: The Colour Language Game had proven to be a challenging yet rewarding task, one that forced me to create new stems of knowledge and grow as a future data professional. The first challenge came in the form of the instructions. As I'm still very new and budding to this form of machine learning, the instructions at first were unfamiliar and many times I was left puzzled as to how I should even approach a certain task. However, I combatted this by reading the instructions line-by-line and writing out my thoughts on a piece of pencil and paper. I detailed what I thought each task meant and how to implement the functionalities necessary to complete the task above. For example, by removing the excess RGB values and finding the most common color_label, I researched the panda's library and found the value_counts() and idxmax() functions, which completely solved this task. This process of writing and researching before typing out code was beneficial as it allowed me to stay organized. Another challenge I found was in the process of creating graphs. When I originally wrote this program, I used matplotlib for all the scatterplots, bar graphs, etc. However, when using this library I found that the graphs had often remained cluttered and the process of organizing/labeling each point would prove to only further clutter my graphs. Determined to make my project appear as neat as possible, I found the plotly library which is now one of my favorite implementations to use in Python. This library allows me to create easy interactive graphs that would otherwise be impossible for someone of my lacking skill to create. This process also taught me the importance of reading documentation for libraries as I was able to learn how to implement subplots. 

Beyond the challenges, I believe my techniques in general had been very effective in completing this task at hand. For example, in training and generating a performance report, I created a function to not only train the model but also calculate all the necessary metrics. However, one area that can be improved is the actual accuracy of the model as it hovers around a very low 40%. I believe this comes from the dataset having colors that often repeat. In the future, I hope to fix this accuracy by either adding more data points or by further sorting my data to make it so each color name has exactly one RGB representation. Furthermore, I hope to improve my code by using more classes and object-oriented design principles to allow for a more streamlined process of analyzing the data. For example, in the graphs section, I used a class to make it easier to actually create the graph. In general, I am proud of myself for completing this project and learning several new techniques that will push me to new heights as I continue on my Data Science journey.

# Extra Functionality

For extra functionality I wanted to add a pie-chart that shows how much each color appears in the list of predictions and the list of actual color labels. This is because the data had been very skewed by these representations, and I desired to find which colors most affected the data. 

In [None]:
'''
To create the pie charts I will use the plotly library and create 2 subplots of side-by-side charts. One chart will
represent the color distribution in the actual color labels with the other being in the predictions. This program 
works by first setting up the subplot of (1x2), then creating 2 pie charts, making them neater, and labeling the 
title
'''
#Creates the subplots
fig = make_subplots(rows = 1, cols =2, specs = [[{'type' : 'domain'},
                                                {'type':'domain'}]])

fig.add_trace(go.Pie(labels = colors_df['colour_name'], name = 'colour_name', title = "Actual Color Labels"), 
             1, 1)

fig.add_trace(go.Pie(labels = colors_df['predictions'], name = 'predictions', title ="Predictions"), 
             1, 2)

#Removes trace lines 
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')

#titles the graph
fig.update(layout_title_text = 'Color Distributions between Actual Labels and Predictions')
fig.show()

## What I noticed

After this analysis I realize the predictions almost directly mirror how the actual color labels are. For example, in both pie charts the "Pink" and "purple" color labels had appeared the most, while more niche colors appeared less. I strongly believe this unevenness of data is what leads to my kNN model to be less accurate. To fix this I would add more data points to the smaller categories to hopefully even out the dataset. 