
# Smoking Trends Analysis

## Introduction
This project explores global smoking patterns using a multi-year dataset that includes smoking percentages by gender, daily cigarette consumption, and smoker population per country.

## Objectives
- Identify countries with the highest smoking rates
- Compare male and female smoking behaviors
- Explore trends over time and relationships between smoking rate and consumption
- Visualize the data effectively using interactive charts

## Dataset
Source: [CORGIS Smoking Dataset](https://corgis-edu.github.io/corgis/csv/smoking/)


# Smoking Data Visualization Project
This notebook analyzes smoking trends using a global dataset.

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("smoking.csv")

df = df.rename(columns={
    'Data.Daily cigarettes': 'Daily Cigarettes',
    'Data.Percentage.Male': 'Male Smoking %',
    'Data.Percentage.Female': 'Female Smoking %',
    'Data.Percentage.Total': 'Total Smoking %',
    'Data.Smokers.Total': 'Total Smokers',
    'Data.Smokers.Male': 'Male Smokers',
    'Data.Smokers.Female': 'Female Smokers'
})

In [None]:
fig = px.line(df, x='Year', y='Total Smoking %', color='Country',
              title='Total Smoking Percentage Over Time by Country')
fig.show()

In [None]:
latest_year = df['Year'].max()
df_latest = df[df['Year'] == latest_year]

fig = go.Figure()
fig.add_trace(go.Bar(x=df_latest['Country'], y=df_latest['Male Smoking %'],
                     name='Male', marker_color='blue'))
fig.add_trace(go.Bar(x=df_latest['Country'], y=df_latest['Female Smoking %'],
                     name='Female', marker_color='pink'))

fig.update_layout(barmode='group', title='Smoking Percentage by Gender (Latest Year)',
                  xaxis_title='Country', yaxis_title='Smoking %', xaxis_tickangle=45)
fig.show()

In [None]:
pivot = df.pivot_table(index='Country', columns='Year', values='Total Smoking %')

plt.figure(figsize=(16, 10))
sns.heatmap(pivot, annot=False, cmap='YlGnBu')
plt.title('Heatmap of Smoking Percentage by Country and Year')
plt.xlabel('Year')
plt.ylabel('Country')
plt.tight_layout()
plt.show()

In [None]:
top_smoking = df[df['Year'] == latest_year].sort_values('Total Smoking %', ascending=False).head(10)

fig = px.bar(top_smoking, x='Country', y='Total Smoking %', 
             title='Top 10 Countries with Highest Smoking Rates (Latest Year)',
             color='Total Smoking %', color_continuous_scale='Reds')
fig.show()

In [None]:
selected_countries = ['Indonesia', 'United States', 'China', 'India']
df_filtered = df[df['Country'].isin(selected_countries)]

fig = px.line(df_filtered, x='Year', y='Total Smoking %', color='Country',
              title='Smoking Trends in Selected Countries')
fig.show()

In [None]:
fig = px.scatter(df, x='Daily Cigarettes', y='Total Smoking %', 
                 trendline='ols', color='Country',
                 title='Correlation Between Daily Cigarettes and Smoking Rate')
fig.show()

In [None]:
fig = px.box(df_latest, y=['Male Smoking %', 'Female Smoking %'], 
             title='Gender-based Smoking Percentage Distribution (Latest Year)')
fig.show()

In [None]:
fig = px.choropleth(df, locations='Country', locationmode='country names',
                    color='Total Smoking %', hover_name='Country',
                    animation_frame='Year', title='Global Smoking Trends Over Time',
                    color_continuous_scale='OrRd')
fig.show()

In [None]:

import ipywidgets as widgets
from IPython.display import display

country_selector = widgets.Dropdown(options=df['Country'].unique(), description='Country:')
display(country_selector)

def plot_country(country):
    country_df = df[df['Country'] == country]
    fig = px.line(country_df, x='Year', y='Total Smoking %', title=f'Smoking Trend in {country}')
    fig.show()

widgets.interact(plot_country, country=country_selector)



## ❗ Why PySpark, Kafka, or Hadoop Were Not Used

Although tools like **PySpark**, **Kafka**, and **Hadoop** were introduced in class for handling big data workflows, they were not necessary for this project because:

- 📊 The dataset is **small** (CSV format, a few thousand rows) and can be efficiently processed using **Pandas**.
- ⚡ No **real-time streaming** is involved, so **Kafka** is not applicable.
- 🌐 The data does not require **distributed computing** or storage, so **Hadoop** is not relevant.
- 🧪 The primary goal is **exploratory data visualization**, which is better served by tools like **Jupyter**, **Plotly**, and **Seaborn** for their interactivity and plotting capabilities.

If the dataset were larger or involved real-time updates, those big data tools would be more appropriate.



## Conclusion

- Countries like Indonesia and China have consistently high smoking rates.
- Male smoking is significantly higher than female in most countries.
- There is a positive correlation between daily cigarette consumption and overall smoking prevalence.
- Smoking rates have generally declined in some regions, indicating possible public health progress.

Further research could incorporate policy data or regional economic indicators.
