# COMP 4151 - Project Report 1


## Group member: Jie Ni, Michael Porter, Bennett Poorman

**Note**: 

- All work in this assignment must be done, created, originated from each student. 
- All external assistance must be explicitly mentioned and cited. 
- Using work done by others without explicit citation is considered cheating. 
- A student will receive a zero grade on the assignment for cheating. 
- Repeated offence will lead to additional consequences.

This assignment focuses on the first phase of the project.  The dataset can be obtained here: https://umdrive.memphis.edu/vphan/public/4151/BCHI-dataset_2019-03-04.csv

Give comprehensive answers to these questions, as much as, possible.  When you answer a question, your answer should be in English. At the same time, if applicable, the answer should consist Python code that shows how you obtain your answer.  People should be able to read and understand your answer without guessing on your behalf how to get the answer.

1. What is the Indicator attribute?
2. How many categories of Indicator are there?
3. Explain the "Value" value of row 26382 in this dataset.
4. Explain the "Value" value of row 7833.
5. Explain the "Value" value of row 10682.  What does it mean that the "Sex" value is "Both"?
6. Explain the "Value" value of row 26701.
7. Specifically, which factors does the indicator category 'Social and Economic Factors' consist of?
8. Visualize (e.g. with seaborn) the suicide rate of some specific race in the 3 most populous and 3 least populous cities over the period of the seven years.  Explain in English each step, and show your Python code of each step.


Assume **data** = **pandas.read_csv(https://umdrive.memphis.edu/vphan/public/4151/BCHI-dataset_2019-03-04.csv)**


### 1. What is the Indicator attribute?

+ The indicator attribute specifies what attribute of the population is being measured for a given data point.

### 2. How many categories of Indicator are there?


> Flowing python codes return the value of 'Indicator Category' and 'Indicator' columns respectively.

```python

data['Indicator Category']

data['Indicator']

```

> Using the below python codes, we can find that there are 13 unique 'Indicator Category' and 61 unique 'Indicator' and, from this information, we also know that a particular one or more 'Indicator' fall (can be described) in  'Indicator Category'.

```python
data['Indicator Category'].nunique() # output -> 13
data['Indicator'].nunique() # output -> 61
```


### 3. Explain the "Value" value of row 26382 in this dataset.


```python
data.iloc[[26382]] # returns the 26382th record as a dataframe
# As explained in the "BCHC Requested Methodology" column, the value represents the rate of death of infant per 1000 live births.

data.iloc[[26382],5]
# Taking in consideration of columns 'Year', 'Race/Ethnicity', 'Value', 'Place', 'BCHC Requested Methodology', this particular record convey the message: the rate of death for infant at birth is about 2 per 1000 (2/1000) for Asian population in 2010, Denver, CO.
```

### 4. Explain the "Value" value of row 7833.

```python
data.iloc[[7833]] # returns the 7833th record as a dataframe
# As explained in the "BCHC Requested Methodology" column, the value represents the percentage of adults who meets "CDC-Recommended Physical Activity Levels".

data.iloc[[7833],5]
# Taking in consideration of columns 'Year', 'Race/Ethnicity', 'Value', 'Place', 'BCHC Requested Methodology', this particular record convey the message: the percentage of physical activity for Hispanic adult population of 2011, in Oakland (Alameda County), CA, is about 44.2%.
```

### 5. Explain the "Value" value of row 10682. What does it mean that the "Sex" value is "Both"?

```python
data.iloc[[10682]] # returns the 10682th recird as a dataframe
# As explained in the "BCHC Requested Methodology" column, the value represents the percentage of population with a disability "(including hearing, vision, cognitive, ambulatory, self-care, or independent living difficulties")"

data.iloc[[10682],5]
# Taking in considering of columns 'Year', 'Race/Ethnicity', 'Value', 'Place', 'BCHC Requested Methodology', this particular record convey the message: In 2016, the percentage Hispanic population, in Los Angeles, CA, is about 8.9%.
```

> The value of "Both" in the "Sex" column means that both gender (Male and Female) are taking into account.

### 6. Explain the "Value" value of row 26701.

```python
data.iloc[[26701]] # returns the 26701th recird as a dataframe
# This is same as the answer from question #3.

data.iloc[[26701],5]
# Taking in considering of columns 'Year', 'Race/Ethnicity', 'Value', 'Place', 'BCHC Requested Methodology', this particular record convey the message: the rate of death for infant at birth is about 5 per 1000 (5/1000) for White population in 2012, San Antonio, TX.
```

### 7. Specifically, which factors does the indicator category 'Social and Economic Factors' consist of?

```python
data.loc[data['Indicator Category'] == "Social and Economic Factors", 'Indicator'].unique()

# Below is the output

# [
#        'Median Household Income (Dollars)',
#        'Percent Living Below 200% Poverty Level',
#        'Percent of 3 and 4 Year Olds Currently Enrolled in Preschool',
#        'Percent of Children Living in Poverty',
#        'Percent of High School Graduates (Over Age 18)',
#        'Percent of Households Whose Housing Costs Exceed 35% of Income',
#        'Percent of Population Uninsured', 'Percent Unemployed'
# ]

```

### 8. Visualize (e.g. with seaborn) the suicide rate of some specific race in the 3 most populous and 3 least populous cities over the period of the seven years. Explain in English each step, and show your Python code of each step.

```python
exclude_place = 'U.S. Total, U.S. Total'
population_data = data[(data['Indicator'] == 'Total Population (People)') & (data['Place'] != exclude_place)]

suicide_data = data[(data['Indicator'] == "Suicide Rate (Age-Adjusted; Per 100,000 people)") & (data['Place'] != exclude_place)]

# merge two data (suicide and population) together based on (year, sex, race/ethnicity, and place)
# then sort the value by year and population (both ascending) and get the top 3 value after group it by year.
# this will give us the 3 least populous cities from period [2012 to 2016] 
#(Note that some years were excluded because some data was missing in the "Indicator" column and we match two data frame by year, sex("both"), race/ethnicity, and place)
least_populous_data = pd.merge(population_data,suicide_data, on=['Year','Sex','Race/Ethnicity','Place'], suffixes=("_pop","_suicide")).sort_values(by=['Year','Value_pop'],ascending=[True,True]).groupby("Year").head(3)

# similarily, the data for 3 most populous cities on different time period
most_populous_data = pd.merge(population_data,suicide_data, on=['Year','Sex','Race/Ethnicity','Place'], suffixes=("_pop","_suicide")).sort_values(by=['Year','Value_pop'],ascending=[True,False]).groupby("Year").head(3)


import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="whitegrid")

# the order of city population is ascending (from left to right: less populous to more populous)
fig, axes = plt.subplots(5,figsize=(10,10))
j = 0
for year in least_populous_data['Year'].unique():
    sns.barplot(x='Place',y='Value_suicide',data=least_populous_data[least_populous_data['Year']==year],ax=axes[j])
    j += 1
    
# the order of city population is descending (from left to right: more populous to less populous)
fig, axes = plt.subplots(5,figsize=(10,10))
j = 0
for year in most_populous_data['Year'].unique():
    sns.barplot(x='Place',y='Value_suicide',data=most_populous_data[most_populous_data['Year']==year],ax=axes[j])
    j += 1
```
