## Week 2-2 - Cluster analysis

Your mission is to analyze a data set of social attitudes by turning it into vectors, then visualizing the result.

### 1. Choose a topic and get your data

We're going to be working with data from the General Social Survey, which asks Americans thousands of questions ever year, over decades. This is an enormous data set and there have been very many stories written from its data. The first thing you need to do is decide which questions and which years you are going to try to analyze.

Use their [data explorer](https://gssdataexplorer.norc.org/) to see what's available, and ultimately download an Excel file with the data.







In [7]:
import pandas as pd
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

In [1]:
# load your data set here

### 3. Turn people  into vectors
I know, it sounds cruel. We're trying to group people, but computers can only group vectors, so there we are. 

Translating the spreadsheet you downloaded from GSS Explorer into vectors is a multistep process. Generally, each row of the spreadsheet is one person, and each column is one qeustion. 

- First, we need to throw away any extra rows and columns: headers, questions with no data, etc.
- Many GSS questions already have numerical answers. These usually don't require any work.
- But you'll need to turn categorical variables into numbers. Here's [how to turn categorical variables into numbers](http://pbpython.com/categorical-encoding.html) in Pandas.


The easiest way to turn categories into numbers is like this:

In [11]:
df = pd.DataFrame({'numbers':[100,50,200,10,150], 'animal':['cat','frog','cat','moose','frog']})
df

Unnamed: 0,animal,numbers
0,cat,100
1,frog,50
2,cat,200
3,moose,10
4,frog,150


In [13]:
df['animal'] = df['animal'].astype('category')
df # now it looks the same, but it's stored internally as category codes

Unnamed: 0,animal,numbers
0,cat,100
1,frog,50
2,cat,200
3,moose,10
4,frog,150


In [14]:
# add a new column with the numeric representation of the categorical column
df['animal-code'] = df['animal'].cat.codes
df 

Unnamed: 0,animal,numbers,animal-code
0,cat,100,0
1,frog,50,1
2,cat,200,0
3,moose,10,2
4,frog,150,1




When you're done preparing up your data, your data frame of vectors should have one row per person and one column per question, and only numeric values. Everything else -- including the original categorical answers and any other information -- has to go. Actually, save it in a separate dataframe, we'll use it later for interpretation.

In [15]:
# Turn your dataframe into feature vectors here

### 4. Plot those vectors!
For this assignment, we'll use the PCA projection algorithm to make 2D (or 3D!) pictures of the set of vectors. Once you have the vectors, it should be easy to make a PCA plot using the steps we followed in class.
    

In [None]:
# make a PCA plot here

### 5. Add color to help interpretation
Congratulations, you have a picture of a blob of dots. Hopefully, that blob has some structure representing clusters of similar people. To understand what the plot is telling us, it really helps to take one of the original variables and use it to assign colors to the points. 

So: pick one of the questions that you think will separate people into natural groups. Use it to set the color of the dots in your scatterplot. By repeating this with different questions, or combining questions (like two binary questions giving rise to a four color scheme) you should be able to figure out what the structure of the clusters represents. 


### 6. Tell us what it means?
What did you learn from this exercise? Did you find the standard left-right divide? Or urban-rural? Early adopters vs. luddites? People with vs. without children? 

What did you learn? What could end up in a story? 


In [16]:
# What does it all mean?