# INFOSCI 2950: Final Project Phase II Submission

*Madelyn Leon, Lincy Chen, and Jessica Kuang*

---
## Research Questions

**Main Goal: How can we define Asian-American communities in Texas in terms of health, happiness, and financial security? and How do health, finances, community support (or the lack thereof), and identities affect an Texan Asian American's quality of life?**

*Financial Security*
1. Do some ethnic groups of Asian-Americans (AAs) in Texas earn more than others?
2. Do younger Asian-Americans earn more money than older generations of Asian Americans living in Texas?
3. How does household size relate to duration of residency for AAs in Texas?
4. How does household size and income relate to the quality of life of a participant? 
5. How does the amount of education completed relate to income? Does this rely on whether the participant was US born?
*Happiness*
1. Are some ethnic groups more likely to be born in the United States than others?
2. What are the most dominant religions among ethnic groups in Asian-American communities in Texas?
3. How is English speaking ability related to English difficulty among AA communities in Texas?
4. Are certain religious groups within AAs more likely to experience discrimination than other relgious groups within AAs in Texas?
5. Are certain ethnic groups within AAs more likely to experience discrimination than other ethnic groups within AAs in Texas?
*Health*
1. Is smoking among Asian-American populations linked to heart disease?
2. Are older generations of AAs more likely to follow a healthy diet than younger generations of AAs in Texas?
3. How do the variables healthy diet and regular exercise correlate with eachother?
4. Does the presence of health insurance and check ups increase someone's quality of life?
*Community Support*
1. Can we predit someone's quality of life by looking at the amount of close friends they have?
2. Is religious affiliation a good predictor of quality of life?
3. What quality of life do retired people tend to have?

---

## Data Collection and Cleaning
### Data Collection
1. Go to data [landing page](https://data.austintexas.gov/City-Government/Final-Report-of-the-Asian-American-Quality-of-Life/hc5t-p62z). 
2. Click on Export > CSV.
3. Download publicly available `Final_Report_of_the_Asian_American_Quality_of_Life__AAQoL_.csv` into desired directory.

### Data Cleaning
1. Store raw data into a preliminary dataframe, `df`
2. Convert column names into snake_case

In [2]:
## load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
## Step 1
df = pd.read_csv('Final_Report_of_the_Asian_American_Quality_of_Life__AAQoL_.csv')

In [5]:
## Step 2
new_colnames = [i.lower() for i in df.columns]
new_colnames = [i.replace(" ","_") for i in new_colnames]

#### Column Names
3. Save these new column names to a new dataframe, `asian`
4. Select columns from the existing `asian` dataframe for data analysis
- [ ] Income
- [ ] Retired
- [ ] US Born
- [ ] English Speaking
- [ ] English Difficulties
- [ ] Ethnicity
- [ ] Age
- [ ] Regular Exercise
- [ ] Healthy Diet
- [ ] Heart Disease
- [ ] Drinking
- [ ] Smoking
- [ ] Cancer
- [ ] Health Insurance
- [ ] Physical Check-up
- [ ] Quality of Life
- [ ] Religion
- [ ] Gender
- [ ] Close Friends
- [ ] Discrimination
- [ ] Duration of Residency
- [ ] Household Size
- [ ] Education Completed
- [ ] Cleaning Entries

In [6]:
## Step 3
asian = df.copy()
asian.columns = new_colnames

In [7]:
## Step 4
asian = asian[['income', 'retired', 'us_born', 'english_speaking', 'english_difficulties', 'ethnicity','age', 'regular_exercise', 'healthy_diet', 'heart_disease', 'drinking', 'smoking',
               'cancer', 'health_insurance', 'physical_check-up', 'quality_of_life', 'religion', 'gender', 'close_friends', 'discrimination_', 'duration_of_residency', 'household_size',
              'education_completed']]
# Additional improvements to asian
asian = asian.rename(columns = {'discrimination_':'discrimination', 'physical_check-up':'physical_checkup'})

#### Cleaning Entries
5. For the corresponding columns, convert null data according to the table below:

| Column                | Modification to NaNs           |
|-----------------------|--------------------------------|
| duration_of_residency | -1                             |
| education_completed   | -1                             |
| discrimination        | 0                              |
| household_size        | 0                              |
| english_speaking      | 0                              |
| english_difficulties  | 0                              |
| retired               | 0                              |
| us_born               | 0                              |
| health_insurance      | 0                              |
| physical_checkup      | 0                              |
| regular_exercise      | 0                              |
| healthy_diet          | 0                              |
| heart_disease         | 0                              |
| income                | 0                              |
| quality_of_life       | median (5.0)                   |
| age                   | median (40.0)                  |
| close_friends         | median (3.0)                   |
| gender                | "Unknown"                      |
| ethnicity             | "Unknown"                      |
| religion              | "Unknown"                      |

6. Replace column data to binary responses with 1s indicating ‘Yes’es and 0s indicating ‘No’es
8. Convert column data into intended data types
| Column                | Data Type                                                                          |
|-----------------------|------------------------------------------------------------------------------------|
| income                | int: {0 (for NaNs), 1 (0-9999), 2 (10000-19999), ..., 7 (60000-69999), 8 (70000+)} |
| retired               | int: {0 for Noes, 1 for Yeses}                                                     |
| US Born               | int: {0 for Noes, 1 for Yeses}                                                     |
| English Speaking      | int: {'Not at all': 1, 'Not well': 2, 'Well': 3, 'Very well': 4}                                                                    |
| English Difficulties  | int: {'Not at all': 1, 'Not much': 2, 'Much': 3, 'Very much': 4}                                                                            |
| Ethnicity             | String                                                                             |
| Age                   | int                                                                                |
| Regular Exercise      | int: {0 for Noes, 1 for Yeses}                                                     |
| Healthy Diet          | int: {0 for Noes, 1 for Yeses}                                                     |
| Heart Disease         | int: {0 for Noes, 1 for Yeses}                                                     |
| Drinking              | int: {0 for Noes, 1 for Yeses}                                                     |
| Smoking               | int: {0 for Noes, 1 for Yeses}                                                     |
| Cancer                | int: {0 for Noes, 1 for Yeses}                                                     |
| Health Insurance      | int: {0 for Noes, 1 for Yeses}                                                     |
| Physical Check-up     | int: {0 for Noes, 1 for Yeses}                                                     |
| Quality of Life       |    'Not at all': 1, 'Not well': 2, 'Well': 3, 'Very well': 4                                                                                |
| Religion              | String                                                                             |
| Gender                | String                                                                             |
| Close Friends         |    floats                                                                                |
| Discrimination        | int: {0 for Noes, 1 for Yeses}                                                     |
| Duration of Residency | float                                                                              |
| Household Size        | int                                                                                |
| Education Completed   |    int                                                                                |

In [8]:
asian['quality_of_life']

0       NaN
1       NaN
2       8.0
3       NaN
4       NaN
       ... 
2604    8.0
2605    9.0
2606    6.0
2607    6.0
2608    8.0
Name: quality_of_life, Length: 2609, dtype: float64

In [None]:
## Step 5, 6, 7

#replacing NaNs with 0
asian['english_speaking'] = asian['english_speaking'].fillna(0)
    
#replacing NaNs with 0
asian['english_difficulties'] = asian['english_difficulties'].fillna(0)

#replacing NaNs with 0
asian['retired'] = asian['retired'].fillna(0)

#replacing NaNs with 0
asian['us_born'] = asian['us_born'].fillna(0)

#replacing NaNs with 0
asian['health_insurance'] = asian['health_insurance'].fillna(0)

#replacing NaNs with 0
asian['physical_checkup'] = asian['physical_checkup'].fillna(0)

#replacing NaNs with 0
asian['regular_exercise'] = asian['regular_exercise'].fillna(0)

#replacing NaNs with 0
asian['healthy_diet'] = asian['healthy_diet'].fillna(0)

#replacing NaNs with 0
asian['heart_disease'] = asian['heart_disease'].fillna(0)

#replacing NaNs with 5.0
asian['quality_of_life'] = asian['quality_of_life'].fillna(5.0)

#replacing NaNs with 0
asian['discrimination'] = asian['discrimination'].fillna(0)

#replacing NaNs with -1
asian['duration_of_residency'] = asian['duration_of_residency'].fillna(-1)

#replacing NaNs with 0 because househols_size can not be 0 because participants are counting themselves
asian['household_size'] = asian['household_size'].fillna(0)

#replacing NaNs with -1
asian['education_completed'] = asian['education_completed'].fillna(-1)

#replacing NaNs with Unknown
asian['gender'] = asian['gender'].fillna('Unknown')

#replacing NaNs with Unknown
asian['ethnicity'] = asian['ethnicity'].fillna('Unknown')

#replacing NaNs with Unknown
asian['religion'] = asian['religion'].fillna('Unknown')

#replacing NaNs with median age
asian['age'] = asian['age'].fillna(40.0)

#replacing NaNs with median number of close friends
asian['close_friends'] = asian['close_friends'].fillna(3.0)

#replacing NaNs with 0
asian['income'] = asian['income'].fillna(0)

## <TO-DO> Paste affliated code here!
#changing english_speaking column to be represented by floats
asian['english_speaking'].replace({'Not at all': 1, 'Not well': 2, 'Well': 3, 'Very well': 4}, inplace = True)
asian['english_speaking'] = asian['english_speaking'].astype(float, errors = 'raise')

# changing english_difficulties column to be represented by floats
asian['english_difficulties'].replace({'Not at all': 1, 'Not much': 2, 'Much': 3, 'Very much': 4}, inplace = True)
asian['english_difficuties'] = asian['english_difficulties'].astype(float, errors = 'raise')

# changing retired column to to be represented by floats
asian["retired"].replace({"Retired": 1}, inplace=True)
asian["retired"] = asian["retired"].astype(float, errors='raise')

# changing us_born column to to be represented by floats
asian["us_born"].replace({"No": 0, "Yes": 1}, inplace=True)
asian["us_born"] = asian["us_born"].astype(float, errors='raise')

# changing health_insurance column 
asian['health_insurance'].replace({"Yes": "1"}, inplace=True)

# chaning physical_checkup column 
asian["physical_checkup"].replace({"Yes": "1"}, inplace=True)
asian['ohysical_checkup'] = asian['physical_checkup'].astype(float, errors = 'raise')

# changing income column entries to be represented by integers
asian['income'].replace({'$0 - $9,999': 1, '$10,000 - $19,999': 2, '$20,000 - $29,999': 3, '$30,000 - $39,999': 4, 
                        '$40,000 - $49,999': 5, '$50,000 - $59,999': 6, '$60,000 - $69,999': 7, '$70,000 and over': 8},
                       inplace = True)

In [None]:
asian.head()

---

## Data Description
Within this study, the participants—located in Austin—started by responding to the survey questions asked with respect and appreciation of diverse cultures and acknowledgement of the legacy of the Asian community. More specifically, the data that was collected and recorded was centered around the primary objective of improving Asian American resources for the local Austin community. These resources included topics that covers health, housing, culture, civic engagement, and economic development. The process of acquiring this data set spanned across a three-year community engagement process, on which commissioners, consultants, and city staff partnered with agencies and volunteers to meet with the Asian Americans in the community. For the survey, more than 3,350 individuals took either one of the surveys either online or in person throughout the city. This occurred at “Conversation Over Tea” and the other City of Austin. These locations, in addition to travel booths,  were used to facilitate dialogue and anecdotes from their lives. The participants covered are from every ZIP code within Austin and it’s neighboring areas. In terms of ethical concerns, the participants were aware that they’re data was being collected and shared for the purpose of better understanding their experiences as an Asian American. A raw source to the data can be found [here](https://data.austintexas.gov/City-Government/Final-Report-of-the-Asian-American-Quality-of-Life/hc5t-p62z), under Export > CSV configurations.

---

## Data Limitations

Limited amount of quantitative data values will make it difficult to generate traditional-looking scatterplots and visualize possible relationships between variables. We foresee a barriers in conducting assessments that predict based off of existing trends due to this. When addressing demographics, the lack of records for previous years will be an obstacle when making comparisons on whether factors measuring the quality of life of Asian-Americans progressed or regressed. One real-world impact that can be derived from this limitation is that Asian-Americans will be less able to identify existing quality-of-life indicators, and  practices that the city of Austin should continue or cease based off patterns in the data. This will restrict the level of specificity in insights concerning emerging patterns we see. Data collected is also limited to representing the attitudes of Asian-Americans living in Texas. Their environment, for example, may be different from the attitudes of Asian-Americans in more or less urban areas, and cannot encapsulate the attitudes of all Asian-Americans. The meaning of the results in this case, would be less useful in terms of its applicability as we do not have randomly-sampled results of all Asian-Americans. This datasets' quality of record-keeping also faltered in areas where respondents were able to leave answers blank. This consequently resulted in NaNs, impacting the meaning of the results  

Certain ethnic groups dominate over others which may lead to skewed results when extrapolating quality of life measurements in mentioning the atittudes of Asian-American populations. For example, out of all the ethnic groups described, Protestant Asian-Americans reported experiencing the highest amount of discrimination. However, it would be innaccurate to assume that Protestantism is a motivating factor for racists to engage in discriminatory behaviors towards Asian-Americans. Some response variables will be affected by confounding variables such as different cultural aspects among sub-ethnic groups. (e.g. some ethnic groups have leaner diets which would impact conclusions drawn when studying Asian-American health). In this case, the higher rates of discrimination can also be explained in how the data shows Korean-Americans are primarily of Protestant faith. Given that they were also the ethnicity group to report the highest amounts of discrimination, this shows how the data can only go so far to explain why certain qualities might cause others. 

---

## Exploratory Data Analysis
### Outline

#### Scatterplots
- [ ] Household Size vs Duration of Residency

#### Bar Charts
- [ ] Median Income Brackets per Ethnicity
- [ ] Percentage of US-born per Ethnicity
- [ ] Percentage of Religion per Ethnicity
- [ ] Percentage of Discrimination per Ethnicity
- [ ] Percentage of Discrimination per Religion

#### Histograms
- [ ] For all columns 

#### Boxplots 
- [ ] Income vs Age

#### Correlation
- [ ] Visualizing correlation

### Sample Summary Statistics

In [None]:
for col in asian.columns:
    display(asian[col].value_counts())
    print()

In [None]:
asian.describe()

In [None]:
asian.median()

Median household income for Asian Americans living in Austin, Texas:

In [None]:
asian['income'].median()

> The median income bracket for Asian-Americans living in Austin, Texas is between \\$50,000 - \\$59,999.

In [None]:
asian['us_born'].value_counts()

> The majority of Asian-Americans living in Austin, TX are not born in the United States. (Nos: 2082 > Yeses: 225)

If performing statistical tests, do the distribution of the variables satisfy assumptions?
> One of the requirements for the chi-square test for independence is that we procure a large enough sample (counts > 5) and the samples are independent.

- Based on prior analysis of the data, we have determined that the number of counts (2307) is sufficient to perform statistical inference on. 

**calculating correlation**

In [None]:
asian.corr()

### Sample Relevant Plots

**Boxplot for age and income**

In [None]:
asian.boxplot('age', by = 'income', figsize = (10,10) )

**scatterplot of household size vs. duration of residency**

In [None]:
plt.scatter(asian['duration_of_residency'], asian['household_size'])
plt.xlabel('Duration of Residency')
plt.ylabel('Household size')

**visualising correlations**

In [None]:
sns.heatmap(asian.corr(), center=0, cmap='coolwarm' )
plt.show()

**Histograms of Variables**

In [None]:
asian.hist(bins = 10, figsize = (12,12))
# plt.tight_layout()
plt.show()

#### Distribution of ethnicities among Asian-Americans living in Austin, Texas

In [None]:
asian['ethnicity'].value_counts().plot(kind='bar');
plt.title("Distribution of Ethnicities among Asian-Americans")
plt.xlabel("Ethnicity")
plt.ylabel("Counts");

In [None]:
asian['income'].value_counts().plot(kind='bar');

#### Distribution of religious groups by ethnicity in Asian-American communities in Austin, Texas

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(x='religion', data=asian, hue = 'ethnicity');
plt.legend(loc='upper right');
plt.xlabel("Religion")
plt.ylabel("Proportion");
plt.title("Distribution of Religions by Ethnicity");

#### Distribution of Natural Citizens by ethnicity in Asian-American communities in Austin, Texas

In [None]:
plt.figure(figsize=(8,4))
sns.countplot(x='us_born', data=asian, hue = 'ethnicity', palette = "hls");
plt.legend(loc='upper right');
plt.xlabel("US-Born")
plt.ylabel("Proportion");
plt.title("Distribution of Natural Citizens by Ethnicity");
positions = (0.0, 1.0);
labels = ("No", "Yes");
plt.xticks(positions, labels);

#### Median Income Brackets per Ethnicity

In [None]:
plt.figure(figsize=(8,4))
shakeshack = asian.groupby(['ethnicity'])['income'].median()
sns.barplot(x = shakeshack.index, y = shakeshack.values, palette = "Set2");
plt.xlabel("Ethnicity")
plt.ylabel("Median Income Brackets");
positions = (1, 2, 3, 4, 5, 6, 7, 8);                           
labels = ("\$0 - $9,999","\$10,000 - $19,999", "\$20,000 - $29,999", "\$30,000 - $39,999", "\$40,000 - $49,999", "\$50,000 - $59,999", "\$60,000 - $69,999", '$70,000 and over');
plt.yticks(positions, labels);
plt.title("Median Income Brackets per Ethnicity");

#### Percentage of Discrimination per Ethnicity

In [None]:
plt.figure(figsize=(8,4));
kfc = asian.groupby(['ethnicity'])['discrimination'].mean()
sns.barplot(x=kfc.index, data=asian, y = kfc.values);
plt.xlabel("Ethnicity");
plt.ylabel("Proportion Experiencing Discrimination");
plt.title("Discrimination");
print(kfc)

This pertains to research question 5 under Happiness. 

> Are certain ethnic groups within AAs more likely to experience discrimination than other ethnic groups within AAs in Texas?"

Based on the visualization produced above, 33.1% of Korean-Americans in Austin, Texas have experienced some form of discrimination, followed by 31.8% of Chinese with 16.9% of Indians reporting the least amount of discrimination.

#### Proportion Discrimination per Religion

In [None]:
plt.figure(figsize=(8,4));
chikfila = asian.groupby(['religion'])['discrimination'].mean()
sns.barplot(x=chikfila.index, data=asian, y = chikfila.values);
plt.xlabel("Religion");
plt.ylabel("Proportion Experiencing Discrimination");
plt.title("Discrimination");

This pertains to research question 4 under Happiness.

> Are certain religious groups within AAs more likely to experience discrimination than other relgious groups within AAs in Texas?

Based on the visualization produced above, 35.0% of Asian-American Protestants in Austin, Texas have experienced some form of discrimination, followed by 29.9% of Catholics with 15.2% of Hindus reporting the least amount of discrimination.

In [None]:
asian.columns

In [None]:
asian['ethnicity'][21]

In [12]:
asian['quality_of_life'].dropna(inplace = True)
asian['quality_of_life']

0       NaN
1       NaN
2       8.0
3       NaN
4       NaN
       ... 
2604    8.0
2605    9.0
2606    6.0
2607    6.0
2608    8.0
Name: quality_of_life, Length: 2609, dtype: float64

In [None]:
predictors = list(logistic_data.columns)
predictors.remove('quality_of_life')
# fit logit model and print learned parameters
target_model = LogisticRegression().fit(
    logistic_data[predictors].values,
    logistic_data['quality_of_life']
)

In [11]:
for i in asian['quality_of_life']:
    if i <= 5:
        i = 0
    else:
        i = 1


Introduction

All expectations of typical projects +
clearly describes why the setting is important and what is at stake in the results of the analysis. Even if the reader doesn’t know much about the subject, they know why they care about the results of your analysis.



## Introduction

The American Ultimate Disc League, or AUDL, is a professional ultimate disc
league with teams across the country, and over 600 registered players. We have
collected publicly available data with statistics on the performance of
individual players and teams throughout the 2019 season as well as scraping blog
posts from the AUDL website to associate players with team rosters. We are
analyzing this data to identify patterns in how players impact their teams, how
various statistics are related to one another, and different styles of play that
those players might adopt.

Ultimate disc is a sport played by two teams in a series of points. Each point
has seven players on each side, and consists of one team starting with the disc
and passing it to try to reach the end zone. The defending team attempts to
block a pass to gain possession of the disc and take it to their own end zone.
Each point ends with one team scoring a goal, and becoming the defending team
for the next point. While only 7 players are on the field for each team during a
given point, teams often substitute all 7 players between points, so AUDL teams
often consist of 20 or more players.

This link explains some of the statistics in more detail: <https://www.leaguevine.com/stats/stats/ultimate/key/>

The scraping scripts and additional materials are hosted in this GitHub repository: <https://github.com/samcfuchs/2950>

In [13]:
asian['regular_exercise'].value_counts()

1.0    1604
0.0     992
Name: regular_exercise, dtype: int64

## Introduction

As per 2020's US Census findings, it is reported that approximately 19.9 million people identify as Asian. We have collected publicly available data with statistics on Asian American quality of life in 2018. 




The American Ultimate Disc League, or AUDL, is a professional ultimate disc
league with teams across the country, and over 600 registered players. We have
collected publicly available data with statistics on the performance of
individual players and teams throughout the 2019 season as well as scraping blog
posts from the AUDL website to associate players with team rosters. We are
analyzing this data to identify patterns in how players impact their teams, how
various statistics are related to one another, and different styles of play that
those players might adopt.

Ultimate disc is a sport played by two teams in a series of points. Each point
has seven players on each side, and consists of one team starting with the disc
and passing it to try to reach the end zone. The defending team attempts to
block a pass to gain possession of the disc and take it to their own end zone.
Each point ends with one team scoring a goal, and becoming the defending team
for the next point. While only 7 players are on the field for each team during a
given point, teams often substitute all 7 players between points, so AUDL teams
often consist of 20 or more players.

This link explains some of the statistics in more detail: <https://www.leaguevine.com/stats/stats/ultimate/key/>

The scraping scripts and additional materials are hosted in this GitHub repository: <https://github.com/samcfuchs/2950>


---

## Questions for Reviewers

- Will we be penalized in the event that there certain hypothesis tests that we need to perform can be only be applied with quantitative data?
- What is the ideal number of small research questions?