In [None]:
from IPython.core.display import HTML
from datascience import *

import matplotlib
matplotlib.use('Agg')
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

In [None]:
def css_styling():
    styles = open('../notebook_styles.css', 'r').read()
    return HTML(styles)
css_styling()

### Demography 180 - Social Networks

# Lab 1: Personal networks

Welcome to the first lab for Demography 180 - Social Networks!

In this lab, we will start to analyze the data that we collected as part of the first homework assignment. We will be exploring the structure of Berkeley students' personal networks.

## 1. Introduction

As we discussed last class, there are many different ways to study social networks. One is to study *personal networks*. Personal network studies focus on a particular individual's network members. This approach can be used to study people's sources of social support. Social support is thought to be an important factor for a wide range of different outcomes such as mental and physical health; employment; and education.

Here is an  of the personal network formed by someone's Facebook friends:

![Facebook personal network example](figures/backstrom_kleinberg_2013_fb_ego_example.png)

<div class='imagesource'>
Backstrom and Kleinberg (2013) - [Romantic partnerships and the dispersion of social ties.](http://dl.acm.org/citation.cfm?id=2531642)
</div>


Conceptually, we can think of someone's personal network as looking like this:

![A personal network](figures/ego_network.png)

The focal individual in a personal network study is sometimes called *ego* and the other people who are socially connected to ego are sometimes called *alters*.

One of the most common ways researchers study personal networks is to use surveys. Last week, we talked about a specific survey question:

![GSS important matters question](figures/gss_2004_important_matters.png)

**Question** What is the network that this question is asking about? What are the nodes, and what are the edges?

<div class='response'>
[Answer here]
</div>

For the rest of this lab, when we refer to 'personal networks,' we'll mean the network that this survey question is asking about.

## 2. Survey of Berkeley students

Now we'll turn to the results of the data that you all collected as part of your first homework assignment. 
Our goal will be use our data to better understand the personal networks of Berkeley students.

**Question** Write your name here

<div class='response'>
[Answer here]
</div>

**Question** Write your partner's name here

<div class='response'>
[Answer here]
</div>

**Question** Write an interesting fact about your partner here

<div class='response'>
[Answer here]
</div>

### 2.1 Opening the data

We'll begin by loading the datascience package.

In [None]:
from datascience import *
import pandas as pd

This was actually already done for us in the first code cell, at the top of this notebook. But it's important to remember to do this so that we can use the code in the datascience package.

The dataset is available at the path `../data/survey/berkeley_survey_clean.csv`.

**Question** Write code that loads the dataset as a table called `survey`.

In [None]:
url = "../data/survey/berkeley_survey_clean.csv"
survey = Table.read_table(url)
survey

### 2.2 Exploring the data

Before we start to perform any analysis on a new dataset, it is important to look at the data and be sure that we understand how it is structured. In most cases, there will be documentation that comes with the survey dataset to explain how to analyze it. In our case, we already have a basic understanding of how the data are structured because we all collected and entered it ourselves.

### 2.2.1 Age of respondents

Let's start by getting a sense of who responded to the survey. We'll look at the age and class year of our respondents.

**Question** How many people responded to the survey?

In [None]:
num_respondents = survey.num_rows
num_respondents

To get a sense of what the data look like, print the first several rows.

In [None]:
survey

**Question** What do you expect will be the age range of people who responded to the survey? Why?

<div class='response'>
[Answer here]
</div>

Now let's investigate the actual data.

**Question** What are the highest and lowest ages of the people who responded to the survey?

In [None]:
survey.column('respondent_age').max()

In [None]:
survey_pd['respondent_age'].min()

**Question** What was the average age of a survey respondent?

In [None]:
# Add Code

**Question** Draw a histogram of the ages of respondents.

In [None]:
survey.select('respondent_age').hist()

**Question** By default, histograms can look a bit wonky if we haven't specified the bins to use. Draw the histogram again, this time using bins that are one year of age wide, starting at age 17 and ending at age 28.


In [None]:
# Add code here

**Question** About what percentage of respondents are under age 20?

<div class='response'>
[Answer here]
</div>

### 2.2.2 Class year of respondents

OK, we have a pretty good handle on the age of people who responded to the survey. Now let's look at their class years.

**Question** About what proportion of respondents would you expect to be Freshmen? Seniors? Why?

<div class='response'>
[Answer here]
</div>

**Question** Now calculate the actual proportion of freshman and senior respondents. <br>
*[Hint: you may find the `where` method useful.]*

In [None]:
num_fresh = survey.where('respondent_class', are.equal_to('Freshman')).num_rows
num_fresh / num_respondents

In [None]:
# Calcualte proportion of seniors here

**Question** Are you surprised by the results? If so, can you think of any possible explanations for what you see?

<div class='response'>
[Answer here]
</div>

There is a faster way to count the number of different values of a categorical variable using the `group` method. You haven't talked about this in class yet, but you will. 

**Question** See if you can figure out how `group` works, and then use it to produce a count of the class of our respondents.<br>
*[Hint: you can look at the documentation by running `Table.group?`.]*

In [None]:
Table.group?

In [None]:
survey.group('respondent_class')

**Question** See if you can use `group` and the `barh` methods to create a bar chart with the counts of respondents by class.

In [None]:
tmp = survey.group('respondent_class').sort('count', descending=True)
tmp.barh('respondent_class')

### 2.2.3 Additional exploration

OK, now it's your turn! Explore one more thing about the survey respondents (we're going to hold off on analyzing the alters for the time being). You can pick anything that you are curious about, but some possibilities include:

* where respondents are from
* how many confidants get reported
* what time of day the surveys were conducted

Whatever you decide to look at, be sure to come up with a quantitative way to understand the respondents better and also be sure to make at least one plot.

## 3. Personal networks

OK, so now we have a sense of what kind of respondents we have in our sample. Now we can turn to the actual personal network data

**Question** Using `group`, make a table of of number of confidants our respondents reported.

In [None]:
survey.groups('number_alters')

**Question** Try to make a histogram of this variable. (You may not be successful.)

In [None]:
survey.select('number_alters').hist()

A histogram makes sense for numerical data, but this variable is not currently numerical because of the '6+' category. In order to analyze this variable, we are going to have to decide what to do about the respondents who reported having more than 5 confidants. The function below was written to do this in one particular way.

**Question** Look at the function below. What does it do? How does it handle respondents who reported more than 5 confidants?

In [None]:
def recode_number_alters(na):
    if na in ['1', '2', '3', '4', '5']:
        return int(na)
    elif na in ['6', '6+']:
        return 6

**Question** Using `recode_number_alters`, see if you can create a new column called `num_confidants` which has the recoded values.

In [None]:
survey['num_confidants'] = survey.apply(recode_number_alters, 'number_alters')

**Question** Double-check that the function worked the way you thought it
would by creating a table of the `num_confidants` column.

In [None]:
survey.group('num_confidants')

**Question** Now create a histogram of the `num_confidants` variable.

In [None]:
survey.hist('num_confidants', bins=np.arange(0,7,1))

**Question** Do you find anything surprising about these survey respondents' personal networks? Are they bigger or smaller than the results we saw from the General Social Survey? 

<div class='response'>
[Answer here]
</div>

### 4. Alters

Now we'll start to look more closely at the people that respondents reported having in their personal networks. We don't quite know enough Python to look at all of the alters. So we will focus on the relationship between the respondents and the first alter named. Fortunately, everyone in the survey reported at least one alter.

**Question** What do you expect to see when we compare the ages of the respondents and the ages of the first alters? (Any reasonable prediction is helpful here.)

<div class='response'>
[Answer here]
</div>

**Question** Make a scatter plot that compares the age of respondents (x axis) and the age of the first alter (y axis).

In [None]:
survey.scatter('respondent_age', 'alter1_age')

In [None]:
?survey.scatter

**Question** How would you describe any patterns in the scatter plot? Can you come up with a hypothesis for what might explain them?

<div class='response'>
[Answer here]
</div>

**Question** If you conducted another survey, what additional information could you collect to see if your hypothesis is right?

<div class='response'>
[Answer here]
</div>

**Question** Make a scatter plot that compares the age of respondents (x axis) and the age of the first alter (y axis) color coded by the gender of the first alter.

In [2]:
# Add Code

**Question** Do you see any interesting patterns in the last scatter plot? 

### 6. Submit the lab

You're almost done! Now please create a pdf version of your completed lab by going to the Jupyter 'File' menu, choosing 'Download as' and then 'PDF via LaTeX (.pdf)'. Please save the resulting .pdf on your computer and then submit the .pdf on bcourses.

**The lab must be submitted by the end of the day on Monday, Sep. 11**