# **HunterCS Topics: Clustering Code-Along**

In this Colab, you will learn how to use some basic commands with numpy and matplotlib packages.  Then, you will use those functions to write an algorithm to cluster & visualize data using the k-means clustering algorithm for a 1-dimensional set of data.

# (1) Generating Random Data with numpy
When trying to write an algorithm, we want to make sure we know what the outcome will be. We'll generate data that has obvious clusters so that we can check that our clustering algorithm works properly.

**numpy** Functions that Generate Random Numbers:
* [np.random.randint](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html#numpy.random.randint) - generates a random int across a range
* [np.random.uniform](https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html) - generates a random float across a range
* [np.random.normal](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html#numpy.random.normal) - generates a random float across a normal distribution
* [np.random.choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html#numpy-random-choice) - chooses a random number from an array



In [None]:
# Import numpy


In [None]:
# Generate 1 random int from 0 to 100


In [None]:
# Generate an array of 10 random ints from 0 to 100


In [None]:
# Generate an array of 25 random floats from a uniform distribution


In [None]:
# Generate an array of random floats from a normal dist
# located with a center at 4, scale of 1.5 and size of 25


# (2) Plotting Data with matplotlib

**Matplotlib** is a very versatile plotting library for python (based on MATLAB). Here we will use it for data visualization, so that we can see the clustering in the data. We will be using the pyplot package in the matplotlib library.

Useful **matplotlib.pyplot** functions:
* [plt.hist](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) - plots a histogram
* [plt.scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter) - plots a scatterplot
* [plt.show](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html) - used to show a plot after using multiple plot commands

In [None]:
#import matplotlib


In [None]:
#Create a graph that shows frequency of points (aka histogram)


In [None]:
# Generate an array of zeros with a length of 25 to be the y-coordinates for our scatterplot


# (3) Getting Data
Most data science is done by importing data from a .csv (Comma Separated Value) file. Because our focus today is clustering, we will provide some simpler options:
1. Generate random data to meet our desired criteria
2. Provide data as a numpy array variable from a list
3. Import data from a .csv file

## Option 3A: Getting Data from Random Numbers

In order to make sure our algorithm works, we need to use data that will cluster predictably. We already generated `n1`, which includes 25 random floating-point numbers centered at 5; we'll create more arrays at other values the same way to make sure that we will have distinct clusters. Then we'll combine the arrays into a single dataset.

Useful **numpy** functions:

* [np.concatenate](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html#numpy-concatenate) - combines multiple arrays into single array


In [None]:
# Generate a second set of random numbers with a center at 10, scale of 1.5 and size of 25


In [None]:
# Generate a third set of random numbers with a center at 16, scale of 1.5 and size of 25


In [None]:
# Plot all 3 sets of random numbers
# Note: Use the array of zeros we generated previously, ydata, for the y-axis plotting




In [None]:
# [OPTIONAL] Create a histogram to plot the 3 sets of numbers
# Use the bins parameter to determine how "skinny" the bars should be




In [None]:
# Concatenate the 3 sets of random numbers into 1 larger data array
# Note: the concatenate() function takes in an array of arrays, indicated by []


In [None]:
# Create a scatterplot the new set of data
# Note: Update the ydata array of zeros to match the size of the bigger data array



In [None]:
# [OPTIONAL] Create a histogram to show the data
# Use the bins parameter to determine how "skinny" the bars should be
# plt.hist(data, bins=20)


## Option 3B: Getting Data from a List
Our class completed a survey about:
* horror movie preferences
* favorite number
* preferred number of pets
* how much to pay for shoes
* height in inches

This data was copied into a **numpy** array that can be checked for clustering.

In [None]:
# # Access arrays of data from our class survey
# heights=np.array([65,72,65,60.5,70,77,60,65,67,67,65,60,73.5,60.25,63,68,69,62.5,63,63,70,69,57,68,64,70,67,72,66,72,68,67,63])
# shoes=np.array([180,65,150,200,200,150,200,400,100,200,500,200,350,300,500,80,250,80,200,240,175,350,200,150,150,60,200,200,100,300,75,140,200])
# horror=np.array([2.1,1.1,2.5,2.6,1.2,2.2,1.2,5.8,7.1,5.1,7.2,1.6,4.7,3.3,1.1,4.7,5.0,5.2,4.3,3.3,2.8,4.1,7.8,1.1,1.8,1.6,5.4,3.2,1.0,7.8,5.6,7.0,2.9])
# fave_nums=np.array([7,7,2,8,7,9,4,7,7,3,10,1,3,9,2,4,7,6,10,7,7,7,10,7,2,8,4,8,9,10,7,1,5])
# pets=np.array([4,0,2,3,0,2,1,2,3,2,0,1,1,3,0,4,0,0,1,3,1,4,1,0,2,2,0,0,2,5,2,1,0])
# heights

In [None]:
# Assign arrays to be used for clustering (as data)
# data = heights
# data = shoes
# data = horror
# data=fave_nums       #not good for clustering with 1D data
# data=pets            #not good for clustering with 1D data
# data

In [None]:
# Setup the ydata value
# Note: if 1D data, make an array of zeros the same length as the xdata
#ydata = heights
# ydata = np.zeros(len(data))
# ydata

In [None]:
# Plot the new set of data
# plt.scatter(data, ydata)

## Option 3C: Getting Data from a .CSV file
**pandas** is a python module widely used for handling large data sets like those you might get from a .csv file.  A pandas "Dataframe" is the coding structure used to manipulate the data.

Useful **pandas** functions:

* [pd.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) - import a csv file's data
* [pd.to_numpy](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html) - convert a pandas dataframe (or column)

Notes:
1. *Uncomment code segments to run*
2. *Make sure to import a .csv file into the project, by placing it in the 'content' folder of the Colab*
3. *You can find many real-world .csv datasets at [Kaggle.com](https://www.kaggle.com/datasets?fileType=csv)*

In [None]:
# Import the pandas module
# import pandas as pd

In [None]:
# Import data from a csv file as a "dataframe"
# Note: Make sure you have a .csv file in your project folder (in Colab, check the "content" folder)
# df = pd.read_csv('sample_data/california_housing_test.csv')
# df

In [None]:
# Examine one particular column of the data frame 
#df.population
#df.total_rooms
#df.median_income
#df.median_house_value

In [None]:
# Convert one of the columns from the dataframe to a numpy array
# data = df.median_house_value.to_numpy()
# data

In [None]:
# Generate an array to use for the y-values (either zeros or actual data)
# ydata = [0]*len(data)
# ydata = df.population.to_numpy()  #option for 2D clustering

In [None]:
# Plot the new set of data
# Note: With a very large data set, 1D data will likely look like a straight line
# but clusters may be much more visible with 2D data
# plt.scatter(data, ydata)

# (4) Setting up the k-means Clustering Algorithm
Now that we have data to work with, we should begin setting up the structure for our k-means clustering algorithm.



Important design decisions include: 
* `k` - How many clusters do we want to sort the data into?
* `centroids` - What is the value at the center of all the data points of a particular cluster?  How will we decide on the initial centroid values of those clusters?
* `labels` - How will we determine which cluster each data point will be assigned to?

Useful numpy functions:
* [np.random.choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html#numpy-random-choice) - chooses a random number from an array
* [np.random.uniform](https://numpy.org/doc/stable/reference/random/generated/numpy.random.uniform.html) - generates a random float across a range
* [np.min](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.min.html) - selects the minimum value from a numpy array
* [np.max](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.max.html) - selects the maximum value from a numpy array
* [np.zeros](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html#numpy.zeros) - generates an array of zeros

In [None]:
# Define the number of clusters (k) you expect in the data


In [None]:
# Choose 3 random points to be your starting values for the centroids
# Note: Make sure no 2 centroids are the same value


In [None]:
# Create an array of zeros that will store the labels for which cluster each data point should belong to
# (ie. Cluster label 0, Cluster label 1, and Cluster label 2, etc)
# Note: We need these to be integers and not floats, so use the dtype=np.int8 parameter



# (5) Updating clusters (centroids & labels)
To continue the k-means clustering algorithm, we now need to assign every data point to one of our clusters, depending on which cluster's centroid is closest.

Useful numpy functions:
* [np.zeros](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html#numpy.zeros) - generates an array of zeros
* [np.argmin](https://numpy.org/doc/stable/reference/generated/numpy.argmin.html) - returns the index of the minimum value of an array
* [np.mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html#numpy.mean) - finds the mean of an array

## Option 5A: Updating Clusters Once
Write segments of code that will create a first guess at the best way to cluster the data points.

In [None]:
# For every data-point, calculate the distance to each of the k centroids
# Store the distances in an array (data's size-long) of arrays (k-long) 





In [None]:
# Assign each data-point to a cluster (whichever centroid is closest)




In [None]:
# Update the centroid values to better match the assigned data-points
# Note: You can filter out a subset of the array with [labels == i]




In [None]:
# Generate ycentroids
# Note: if 1D data, make an array of zeros the same length as the xcentroids




## Option 5B: Updating Clusters in a Loop (n Times)
Write a code block that will repeat all of the code blocks from Section 5A a constant number of times, `n`.

In [None]:
# Copy all of Updating Clusters code (5A) into a loop
# Setup the loop to exit after n times

  #print out the centroids each iteration of the loop
  

## Option 5C: Updating Clusters Until Convergence
Write a code block that will repeat all of the code blocks from Section 5A until the clusters converge stop changing or "converge".  Convergence can be checked for if either the labels array or the centroids are stop changing.



Additional features within your loop should:
* `# Store the labels/centroids from each previous iteration`
* `# Check if the old labels/centroids are equivalent to the new labels/centroids`

In [None]:
# Copy all of Updating Clusters code (5A) into a loop
# Setup the loop to exit after convergence
# [OPTIONAL] Setup the loop to exit after convergence OR end early after n iterations



# (6) Plotting the Clusters
Use the matplotlib functions to display a graph showing the data points colored by cluster with clear markers indicating where the centroids are for each cluster.

Useful matplotlib functions:
* [plt.scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html#matplotlib.pyplot.scatter) - plots a scatterplot
  * `x` parameter to specify data on horizontal axis
  * `y` parameter to specify data on vertical axis
  * `s` paramter to specify size of the plot marker
  * `c` parameter to specify a list to color
  * `marker` parameter to choose a [shape/character](https://matplotlib.org/stable/api/markers_api.html) to mark points
  * `linewidths` parameter to specify how thick the marker should be
  * `color` parameter to specify one specific color
* [plt.xlabel](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xlabel.html) - adds label to x-axis
* [plt.ylabel](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylabel.html) - adds label to y-axis
* [plt.title](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.title.html) - adds title to plot
* [plt.show](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.show.html) - used to show a plot after using multiple plot commands

In [None]:
# Create a scatterplot with all the datapoints
# Note: Use c=labels parameter to color the dots based on their cluster


In [None]:
# Create a scatterplot that adds the centroids with a red plus-sign (+) markers
# Note: Look in the documentation for the optional parameters to add


In [None]:
# Plot both the datapoints AND red plus-sign(+)'s on the same plot
# Note: Copy the lines of code from the previous block 


#display the scatterplot
