# Tobacco Use in the US, 1995-2010

### <u>Problem 3(a)</u>

#### Description of dataset
My dataset contains information on tobacco use in the US by state or territory (hereafter, 'jurisdiction'), during the period 1995-2010. For a given year and jurisdiction, a number is given for each of the following four criteria, representing a percentage of the jurisdiction's total population for that year: (i) smoke everyday; (ii) smoke some days; (iii) former smoker; and (iv) never smoked.
#### URL of dataset and download instructions
The dataset's URL is [https://data.cdc.gov/Smoking-Tobacco-Use/BRFSS-Prevalence-and-Trends-Data-Tobacco-Use-Four-/8zak-ewtm]. For some reason, the link breaks, but the dataset can easily be found by searching for "tobacco use" in the search tool on the upper right-hand corner of the linked page. Once you navigate to the dataset's main page, it can be downloaded by clicking on the Export button on the top right-hand corner of the page, and clicking CSV. I have downloaded it as the file
US-Smoking-data_1995-2010.csv and stored it in ./data.
#### Two interesting Questions
1. How have the percentages for the four types of tobacco use fluctuated for Washington State from 1995-2010?
2. How have the percentages of people who smoke every day fluctuated for California, Texas, Florida, and New York (the four most populous states as of 2021)?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### <u>Problem 3(b)</u>

#### Loading the dataset to a pandas dataframe

In [None]:
df = pd.read_csv('data/US-Smoking-data_1995-2010.csv')
df

### <u>Problem 3(c)</u>

#### Using pd.describe() to get a rough overview of the data

In [None]:
df.describe()

### <u>Problem 3(d)</u>

#### (I) Print the array of columns and the index array

In [None]:
print(df.columns)

#### (II) Simple plot of part of the data

The plot below answers the first question I asked above, by showing how the percentages for the four types of tobacco use fluctuated for Washington State from 1995-2010.

In [None]:
d1 = df[df['State']=='Washington']
d1 = d1.sort_values(by=['Year'])
usage_category = ['Smoke everyday', 'Smoke some days', 'Former smoker', 'Never smoked']
plt.figure(figsize=(15,10))
for category in usage_category:
    plt.plot(d1['Year'], d1[category], label = category)
plt.legend()
plt.grid()
plt.ylabel('percentage of population')
plt.title('Washington Tobacco Usage, 1995-2010',fontsize=25)

#### (III) Create a Pivot Table

The pivot table's indices are the values of the original table's Year column, while its columns are the values of the original table's State column. The entries of the pivot table are the values of the Smoke everyday column as they relate to the pivot's table's indices and columns. For example, the entry for the 1996 index and Washington column would be the percentage of people that smoked every day in Washington in 1996. 

In [None]:
df2 = pd.pivot_table(df, values='Smoke everyday', index=['Year'], columns=['State'])
df2

#### (IV) Plot some data from the pivot table

Using the pivot table we just created, we will answer our second question. That is, we will plot the fluctuations of the percentages of people who smoked every day for California, Texas, Florida, and New York from 1995-2010.

In [None]:
df2[['California','Texas','Florida','New York']].plot(figsize=(15,10))
plt.grid()
plt.ylabel('percentage of people that smoked every day')
plt.title('Smoked Every Day, 1995-2010',fontsize=25)

#### (V) Use the groupby feature

Next, we will use panda's groupby feature to answer the following question:
<ul>
    <li>How did the national average and median of the never smoked category of tobacco usage fluctuate from 1995-2010?</li>
</ul>

In [None]:
df3 = df.rename(columns={'Never smoked': 'never_smoked'})
df3.groupby('Year').never_smoked.agg(['mean', 'median']).plot(figsize=(15,10))
plt.ylabel('percentage of people that never smoked')
plt.title("National Percentages of People that Never Smoked, 1995-2010",fontsize=25)

### <u>Problem 3(e)</u>

### Discussion

I will comment on each of the three plots generated.
<ol>
    <li><b>Washington Smoking Habits, 1995-2010:</b> The data shows that, especially when compared to the final plot, that the percentage of people that never smoked in Washington rose from 1995-2010 largely in line with the national average. However, the data also shows that the percentage of people that smoke some days remained steady, and even rose on some years, despite the percentage of people that smoke every day dramatically falling. Perhaps this can be attributed to people who smoked every day trying to quit or at least reduce their tobacco usage.</li>
    <li><b>Smoked Every Day, 1995-2010:</b> The data shows that, at least with respect to the four states that were plotted, that the percentage of people that smoked every day in California started from a significantly lower point than the other three states. Regardless, the rate of decline of every-day smokers in California appear to be similar to the other three states.</li>
    <li><b>National Percentages of People that Never Smoked, 1995-2010:</b> The median is slightly lower than the average, which suggests that some states have increased the percentage of people that never smoked at a higher rate than other states. Further, there is a dramatic spike in 2003. Perhaps this suggests that a population came of age in 2003 that </li>
    
</ol>