# Patterns of Association: Quality of Engligh Spoken by People Who Speak Spanish in Their Homes

---
This notebook focuses on creating scatter plots from a data table and learning how to interpret them by determining associations between two quantitative variables. 

---

### Topics Covered
- What is the Census?
- English fluency among Spanish speaking households
- Scatter plots and associations
- Regional unemployment

### Table of Contents

[Overwiew](#overwiew)<br>


1 - [Why the Census](#1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [How is the data used?](#s1.1)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Census and Congresional Represenation](#s1.2)

2 - [Vocabulary](#2)<br>

3 - [Part 1](#3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.1 - [Apply What You've Learned](#s3.1)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3.2 - [Determine and Explain Negative and Positive Associations](#s3.2)

4 - [Bibliography](#4)<br>

---
## Overview <a id='overwiew'></a>

In Part 1 of this lesson, you will review scatterplots by investigating how trends in English fluency among Spanish-speaking households in the U.S. have changed over time. In Part 2, you will  analyze scatterplots and linear relationships based on state and regional unemployment data.

---


In [9]:
#Press 'Shift' + 'Enter' to run this code! It will help load graphs & charts that this lesson has!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image

import sys
sys.path.append("../")
from censusnotebooks import widgets, visualization


**Tip**: When you're done reading each part of this lesson, press the **'Shift' + 'Enter' keys at the same time** to quickly move through each 'cell' and run the code! 
<!--#Press 'Shift' + 'Enter' to run this code! It will help load graphs & charts that this lesson has! -->
---

# 1. Why the Census? <a id='1'></a>

<div class="alert alert-info">
<b> Question:</b>
<!--But before we dive in, let's answer a question you might have:--> Why are we learning about statistics through Census data? In fact, why do we even care about the Census in the first place?
</div>

Depending on how old you are, you might or might not have watched your family or guardian fill out a long form about a decade ago that had some specific, slightly invasive questions like: 
1. How many people live or stay in this house, apartment, or mobile home?
2. What is the name of the person who owns this house, apartment, or mobile home?
3. How old is the person who owns this house, apartment, or mobile home? When is his or her birthday?

## 1.1 How is the Data Used?<a id='s1.1'></a>

The census form is a way for the government to get a good idea as to **who** it is serving, which is: **YOU!** The government uses data it gets from this nation-wide Census to determine funding distributions across U.S. communities, as well as understand where community services are needed and how to implement them. Because of the Census, your neighborhood and city can improve education and transportation, promote public health, and use given money to make improvements. More services can be given to the elderly, new roads and schools can be built, and more job training centers can be maintained and established. *Everyone* benefits.

## 1.2 Census and Congresional Represenation <a id='s1.2'></a>

The amount of representation that your state has changes within Congress (in the U.S. House of Representatives) every ten years, when the Census is distributed. The voices of those vouching for your state's well-being in Congress could change or stay the same. Additionally, the representative (or "voice") representing your community in local, state, and federal governments could change, because district lines are redrawn every time. Members of Congress, state legislators, and many county and municipal offices are elected by voters grouped into districts, and you could be voting for different people depending on how these lines are drawn. These representatives are your spokespeople - their job is to listen to your needs and support government policies that will help your community!  


For more information, check out this document [here](https://www2.census.gov/programs-surveys/sis/resources/census101.pdf), designed to inform students like you on why the Census is so important! And for more reading on the confusing topic of re-districting, read [here](https://www.brennancenter.org/analysis/7-things-know-about-redistricting). 

---

# 2. Vocabulary

Here are some terms to go over before creating your scatter plot and analyzing English fluency in Spanish-speaking households and how fluency has changed over time. You'll need to understand these before moving on.

* **Bivariate data** – pairs of linked numerical observations. 
(Example: a list of heights and weights for each
player on a football team)
* **Categorical variable** – a variable that is not numerical, such as a name or label, that places an object into
one of several groups or categories (e.g., the color of a ball or the breed of a dog)
* **Quantitative variable** – a variable that is numerical, meaning that it represents a measurable quantity*
(e.g., the population size of a city)
* **Explanatory variable** – the “independent” variable, which helps explain the changes in the response
variable
* **Response variable** – the “dependent” variable, which shows an outcome
* **Form of association** – the shape that data make moving from left to right on a scatter plot (e.g., linear,
parabolic, quadratic)
* **Line of best fit** – a straight line drawn through the center of a group of data points on a scatter plot,
showing how closely the two variables on the scatter plot are associated
* **Strength of association** – a measure of how tightly points are clustered
* **Negative direction of association** – an association in which one variable increases as the other decreases;
when a line of best fit represents a negative association, the line has a negative slope (e.g., for time a
musician spends practicing a piece and mistakes made during performing; the more time spent practicing,
the fewer mistakes that musician will likely make)
* **Outlier** – a data point that is well outside of the expected range of values or does not follow the overall
pattern of the other data points
* **Positive direction of association** – an association in which one variable increases as the other increases;
when a line of best fit represents a positive association, the line has a positive slope (e.g., for children’s age
and height; as children get older, they usually get taller)
* **Scatter plot** – a graph in the coordinate plane that displays a set of bivariate data and can be used to
determine how two variables are associated (e.g., to show associations between the heights and weights
of a group of people)


# 3. Part 1 <a id='3'></a>

## Scatterplot Review

Let's look at some data from the American Community Survey. The American Community Survey (ACS) is conducted monthly by the U.S. Census Bureau and is designed to show how communities change. Through asking questions to a sample of the population, it produces national data on more than 35 categories of information, such as education, income, housing, and employment.

The following data is from the 2009-2013 American Community Survey 5-Year Estimates: 'Language Spoken At Home.' We will examine the trends in the quality of English spoken by U.S. residents who primarily speak Spanish at home by looking at **two** variables: the percentage of Spanish-speaking residents who reported speaking English **“very well”** and the percentage of Spanish-speaking residents who reported speaking English **“less than very well."**



You can see the entire data table [here](https://factfinder.census.gov/bkmk/table/1.0/en/ACS/13_5YR/B16001/0100000US), but we've done the work to extract these two variables for you.

Run this code to see a graph by pressing **'Shift' + 'Enter'**

In [10]:
data = {'Percentage who reported speaking English “less than very well”':[45.7, 44.7, 43.7, 42.1, 42.2],
        'Percentage who reported speaking English “very well”':[54.3, 55.3, 56.3, 57.9, 57.8], 
        'Year': [2009, 2010, 2011, 2012, 2013]} 

chart1 = pd.DataFrame(data)
chart1 = chart1[['Year', 'Percentage who reported speaking English “very well”', 
                 'Percentage who reported speaking English “less than very well”']]
chart1

Unnamed: 0,Year,Percentage who reported speaking English “very well”,Percentage who reported speaking English “less than very well”
0,2009,54.3,45.7
1,2010,55.3,44.7
2,2011,56.3,43.7
3,2012,57.9,42.1
4,2013,57.8,42.2


We're now going to create a **scatter plot** to better visualize the data.

**However**, before you scroll down, take a moment to think about what this graph might look like:

* Which variables are on the x and y axis?

* Can you already predict how the points might look like on the graph? What trends do you see?

**Tips to graph**: The graph you make should have *ten dots* total, with *two* colors: one to represent the percentages of people who reported speaking English "less than very well," and one to representage the percentages of those speaking English "very well." 

The graph should look like this:

In [11]:
#Run this code to see a graph by pressing 'Shift' + 'Enter'
#Image(filename='./quality_of_english.png', width = 600, height = 600) 

Take a few moments to understand the graph above. 

Remind yourself...
* Why are we using _different_ colors in the legend?
* What _patterns_ do you see in the data?

And ask...

* Do you think these two paths ever crossed in earlier years?
* Do you think these paths will cross in future years?

In the cell below work through the plot. Does your own graph look like this one? If there are any differences, try to understand why. If something looks off, that's because there aren't any *labels* distinguishing between the two different trends. We need a **legend** and **axes** for this scatterplot. Talk to your neighbors as well! 

In [12]:
# Run to see the scatter plot by pressing 'Shift' + 'Enter'

# from censusnotebooks.visualization import matplotlib_seaborn

# plotter = matplotlib_seaborn.Plot(chart1)
# plotter.new_plot()



#### Now that you have your full graph, axes labels and title included, it's time to answer a few questions:


<div class="alert alert-info">  
    
1. What is the **explanatory** variable on your scatter plot? 
2. What is the **response** variable for each set of data points?
    
</div>


_Tip_: If you've forgotten what these variables mean again, look at the vocabulary section above again!

**Your answer:** Type your answer here for question 1


**Your answer:** Type your answer here for question 2


If data points on a scatter plot form a *positive* slope, it means that they have a *positive association*. In
these cases, as you look from left to right on the scatter plot, the line appears to move “uphill.” This means
that **as the explanatory variable increases, the response variable tends to increase as well.**

<div class="alert alert-info">  
    
3. Which _category of data_ shows a **positive** association over time?
4. Could you explain why that association may be positive? 
    
</div>

**Your answer:** Type your answer here for question 3


**Your answer:** Type your answer here for question 4


On the other hand, if data points on a scatter plot form a _negative_ slope, it means they have a _negative association_. In these
cases, as you look from left to right on the scatter plot, the line appears to move “downhill.” This means
that **as the explanatory variable increases, the response variable tends to decrease**.

<div class="alert alert-info" >  
    
5. Which _category of data_ shows a **negative** association over time?
6. Could you explain why that association may be negative?
</div>


**Your answer:** Type your answer here for question 5


**Your answer:** Type your answer here for question 6



<div class="alert alert-info">  
    
7. Now, summarize the association between the percentage of Spanish-speaking residents who reported speaking English “very well” and the years 2009 through 2013:

</div>

**Your summary:** Type your answer here


<div class="alert alert-info">  
    
8. Is there a relationship between the percentage of Spanish speakers who reported speaking English “very well” and the percentage of Spanish speakers who reported speaking English “less than very well”? Hint: Have you tried **adding** the two percentages for every year?

</div>

**Your answer:** Type your answer here

------------------------------------------
**Congratulations!** Hopefully, you now not only know how to create scatter plots and describe direction, form, and strength of associations within these scatter plots, but also understand the relationship between two variables measured by the Census and their importance in the real world.




## 3.1 Apply What You've Learned <a id='3.1'></a>

In this part of the lesson, you will practice what you've learned so far about patterns of association, using the guesstimation game below. Try and guess what the direction, form, and strength of each association is below. You will see four examples, with a given context.

<div class="alert alert-warning">
<b>Scenario 1:</b> A company that sells video games wants to see when the optimal times to sell video games will be to increase sales and profit. To accomplish this, a worker at the company decides to study sales patterns for a certain video game. Use the graph below to look at association with x as weeks since the video game was released and y as number of video games sold by one internet vendor.
</div>



In [None]:
# Run this cell to see the graph.
rain_collected = {"Time Since Video Game Release (weeks)":[1, 1.25, 1.375, 1.5, 2.5, 3.5, 5, 6.5, 7, 7.5, 8, 8.5, 9],
                  "Number of Video Games Sold":[2.5, 4.25, 5.4, 6, 7.25, 7.75, 8, 7.8, 7.25, 8.5, 7, 5, 3.5]} 
tbl = pd.DataFrame(rain_collected)
plt.scatter(tbl["Time Since Video Game Release (weeks)"].values, tbl["Number of Video Games Sold"].values)
plt.xlabel("Time Since Video Game Release (weeks)")
plt.ylabel("Number of Video Games Sold")
plt.show()

**Explanation:** Describe the relationship you see in the above graph here

In [None]:
# Type your guesses here! Remember to press 'Shift' + 'Enter' to move on once you're finished!
# Dropdown menus modeled off of below options:

Direction: "" # fill in the "" - is the direction "positive", "negative", or "none"?
Form: "" # fill in the "" - is the form "linear", "curved," or "none"?
Strength: "" # fill in the "" - is the strength "Very strong"? "somewhat strong"? or "weak"?
Explanation: "" # fill in the "" with one sentence describing the relationship you see in the above graph!

---
<div class="alert alert-warning">
<b>Scenario 2:</b> A music student wants to perform in a future piano recital, but she's not sure how much practice she does will relate to an effective amount of improvement for her to perform in time for the recital. Her music instructor decides to observe and note the practice hours and improvement of some of her other students to help this student, as well as future students. Use the graph below to analyze association with x as number of hours spent practicing recital music and y as number of mistakes made, recorded by the music instructor.
</div>

In [None]:
# Run this cell to see the graph.
rain_collected = {"Time Spent Practicing Recital Music (hrs)":[0.25, 0.75, 1.5, 2.5, 2.8, 3, 3.75, 4.6, 5.25, 5.35, 7.15, 8, 9.25],
                  "Number of Mistakes Made":[8.5, 7.25, 1.5, 7.5, 5.8, 7, 4.5, 7, 5, 3.5, 3.7, 2.25, 0.75]} 
tbl = pd.DataFrame(rain_collected)
plt.scatter(tbl["Time Spent Practicing Recital Music (hrs)"].values, tbl["Number of Mistakes Made"].values)
plt.xlabel("Time Spent Practicing Recital Music (hrs)")
plt.ylabel("Number of Mistakes Made")
plt.show()

**Explanation:** Describe the relationship you see in the above graph here

In [None]:
# Type your guesses here! Remember to press 'Shift' + 'Enter' to move on once you're finished!
# Dropdown menus modeled off of below options:

Direction: "" # fill in the "" - is the direction "positive", "negative", or "none"?
Form: "" # fill in the "" - is the form "linear", "curved," or "none"?
Strength: "" # fill in the "" - is the strength "Very strong"? "somewhat strong"? or "weak"?
Explanation: "" # fill in the "" with one sentence describing the relationship you see in the above graph!

---
<div class="alert alert-warning">
<b>Scenario 3:</b> A middle school student and his friends are very interested in baseball, and they all collect baseball cards. One day, the student wonders if the numbers in their individual collections create distractions that cause some of them to take longer to get school than others, making them arrive late. Use the graph below to understand association with x as number of baseball cards owned by a student and y as number of minutes it takes a student to walk to school. 
</div>

---
## 3.2 Determine and Explain Negative and Positive Associations <a id='3.2'></a>

Now that you have learned about scatter plots, both positive and negative associations, but now it's time to put it up to the test!

We are going to go through a few scenerios where **you** will determine if it is a positive or negative association. Similar to before, you wil be given the x-axis and y-axis for the scatter plot but this time without the plot visualization. 

*Note: If you forget the definitions of either positive and negative associations, review Parts 1 and 2!*


<div class="alert alert-warning">
<b>Scenario 1:</b> We have a scatter plot where the <b>x-axis</b> is the number of days a student misses school and the <b>y-axis</b> is the number of hours spent on makeup work. 
</div>

In the cell below type in if you think **Scenario 1** is either a **positive or negative association**. Then briefly explain your reasoning behind your answer.  

**Explanation:** Type your answer here

In [None]:
# Run this code to see a graph by pressing 'Shift' + 'Enter' to see if you answered correctly. 

data = {"Missed School Days":[0,1,1, 4, 3, 10, 2, 2, 8],
        "Hours Spent on Makeup Work":[0,4,4, 22, 16, 50, 8,7, 42]} 
table = pd.DataFrame(data)
plt.scatter(table["Missed School Days"].values, table["Hours Spent on Makeup Work"].values)
plt.xlabel("Missed School Days")
plt.ylabel("Hours Spent on Makeup Work")
plt.show()

--------------------------------------

<div class="alert alert-warning">
<b>Scenario 2:</b> We have a scatter plot where the <b>x-axis</b> is the ounces of insecticide used and the <b>y-axis</b> is the mosquito population (hundreds per acre). 
</div> 


In the cell below type in if you think **Scenario 2** is either a **positive or negative association**. Then briefly explain your reasoning behind your answer.  

**Explanation:** Type your answer here

In [None]:
# Run this code to see a graph by pressing 'Shift' + 'Enter' to see if you answered correctly. 
import random
insecticide = sorted(random.sample(range(1, 500), 10), reverse = True)
population = sorted(random.sample(range(1,100), 10))

data = {"Ounces of Insecticide":insecticide,
        "Mosquito Population":population} 
table = pd.DataFrame(data)
plt.scatter(table["Ounces of Insecticide"].values, table["Mosquito Population"].values)
plt.xlabel("Ounces of Insecticide")
plt.ylabel("Mosquito Population")
plt.show()

--------------------------------------
<div class="alert alert-warning">
    <b>Scenario 3:</b> We have a scatter plot where the <b>x-axis</b> is the number of hours spent studying the spelling of vocabulary words and the <b>y-axis</b> the number of vocabulary words spelled incorrectly. 


In the cell below type in if you think **Scenario 3** is either a **positive or negative association**. Then briefly explain your reasoning behind your answer.  

In [None]:
widgets.Text();

**Explanation:** Type your answer here

In [None]:
# Run this code to see a graph by pressing 'Shift' + 'Enter' to see if you answered correctly. 

hours = sorted(random.sample(range(1, 50), 10), reverse = True)
spelled = sorted(random.sample(range(1,20), 10))

data = {"Hours Spent Studying":hours,
        "Number of Words Spelled Incorrectly":spelled} 
table = pd.DataFrame(data)
plt.scatter(table["Hours Spent Studying"].values, table["Number of Words Spelled Incorrectly"].values)
plt.xlabel("Hours Spent Studying")
plt.ylabel("Number of Words Spelled Incorrectly")
plt.show()

--------------------------
**Congratulations!** You just finished Part 1 of this lesson! Make sure to click 'File', then 'Save and Checkpoint' in the upper left-hand corner to save all the hard work you've done!

### Lastly, before you exit out of this interactive lesson...

...it's always a good idea to connect what you're learning statistically/mathematically with visuals and narratives from the actual communities who were surveyed. ***Numbers are important, but never tell the entire story.***

Let's go back to what we analyzed in **Part 1** of this lesson. Chances are, if you yourself are a child of immigrant parents/grandparents or weren't born in the U.S., the "variable," i.e. the level of English fluency, is relatable to your own life experiences. The Census Bureau has been asking questions about languages spoken at home since 1890, and collected language data in the 1980, 1990, and 2000 decennial censuses using a series of three questions asked of the population 5 years old and over. 

Below is a choropleth map of limited English-Speaking households as a percentage of the total county population in counties across the U.S. A *choropleth* map is essentially a map that represents a quantity or percentage of some variable through *shading* of a color (the darker the color, the 'higher' the percentage or quantity of the thing measured). 

<a href="https://www.census.gov/library/visualizations/2017/comm/english-speaking.html?cid=english-speaking" target="_blank"><img src="https://www.census.gov/content/census/en/library/visualizations/2017/comm/english-speaking/jcr:content/map.detailitem.950.high.jpg/1512595122203.jpg" alt="Limited English Speaking Households as a Percentage of County Total" width="648" height="648" title="Limited English Speaking Households as a Percentage of County Total"/></a>

Visuals allow us to also better understand and appreciate the sheer linguistic and cultural diversity that exists in the U.S. *The visual below is taken from this article here by share.america.gov written in 2017. Share.america.gov is managed by the U.S. Department of the State.*

<img src="https://staticshare.america.gov/uploads/2017/08/Lang_maps_Reorder-01-768x1867.jpg" alt="Most commonly spoken languages other than English in the U.S." width = "500" height = "900" title="Most commonly spoken languages other than English in the U.S."/>

Curious about how commonly your native language(s) are spoken within your *own* community? Find out with this handy *'Language Mapper' Tool* based off of 2011 data [here](https://www.census.gov/hhes/socdemo/language/data/language_map.html). 

It's clear that a diverse plethora of languages and cultures make up the fabric of the U.S..

And yet, even in 2019, **citizens are discriminated and put in danger for even conversing in their native languages.** Testimonies reported by several news outlets, from [The Guardian](https://www.theguardian.com/us-news/2018/may/22/speaking-spanish-dangerous-america-aaron-schlossberg-ice) (*warning: potentially vulgar language*) to [El Pais](https://elpais.com/elpais/2018/05/30/inenglish/1527671538_960209.html) and [Remezcla](https://remezcla.com/features/culture/why-we-wont-stop-speaking-spanish-in-public/), show the grim realities that Spanish speakers/hispanohablantes face. 

Spanish speakers are reprimanded by non-Spanish-speaking white co-workers/older executives for socializing in Spanish 'because “other people” might think they were talking trash about them,' told that '“the least they can do is speak English”,' subjected to the stereotypical assumption that they are unable to 'speak English well enough,' and even detained by Border Patrol in Montana and verbally attacked in public in New York.

Thus, while simply plotting the levels of English fluency amongst Spanish-speaking households across the U.S. and analyzing those graphs are important ways to learn basic statistics concepts, it is crucial to also understand that this data and these numbers are tied to real families who have to go through those experiences. **Actively thinking about *who* the data is describing is equally important as understanding what conclusions can be derived from the data or *what* the data is measuring.**

For Spanish speakers and other individuals who speak a language other than English, the data used in this lesson comes from familiar, close-to-home narratives. If you identify as a member of a non-English-speaking household, **being proud and taking ownership of your language(s), culture(s), and heritage** is the ***ultimate*** form of resistance to combat this horribly racist trend. 

---

## 4. Bibliography <a id='4'></a>


**Notebook developed by:** Varsha Vaidyanath, Jarelly Martin, and Jennifer Kwon

**Modified by:** Varsha Vaidyanath, Jarelly Martin, Yuyang Zhong, and Sandeep Sainath

**Date:** August 4, 2019 

**Using:** https://www.census.gov/programs-surveys/sis/activities/math/patterns.html
    
**Suggested Grade Level:** 9-12

**Data Science Discovery Program:** http://data.berkeley.edu/education/module
