# EDA Report on A34(1) Refusal Data
This notebook performs exploratory data analysis (EDA) on the provided `a34_1_refused_tidy.csv` file. Sections include data loading, cleaning, descriptive statistics, visualizations with Plotly, and time series pattern analysis.


## 1. Data Loading and Overview
Load the CSV file and display basic information.

In [1]:
import pandas as pd

# Load the data
df = pd.read_csv('../data/processed/a34_1_refused_tidy.csv')

# Display shape and first rows
print(f"Data shape: {df.shape}")
df.head()

Data shape: (3486, 6)


Unnamed: 0,inadmissibility_grounds,country,year,cor_status,resident,count
0,A34(1),Afghanistan,2019,COR Not Canada,Permanent Resident,1
1,A34(1),Argentina,2019,COR Not Canada,Permanent Resident,0
2,A34(1),Egypt,2019,COR Not Canada,Permanent Resident,1
3,A34(1),Eritrea,2019,COR Not Canada,Permanent Resident,0
4,A34(1),Haiti,2019,COR Not Canada,Permanent Resident,0


## 2. Data Cleaning and Preprocessing
Check for missing values and prepare a summary table by country, year, and resident status.

In [2]:
# Check for missing values
print(df.isnull().sum())

# Summarize by country, year, and resident status
df_summary = df.groupby(['country', 'year', 'resident'])['count'].sum().reset_index()
df_summary.head()

inadmissibility_grounds    0
country                    0
year                       0
cor_status                 0
resident                   0
count                      0
dtype: int64


Unnamed: 0,country,year,resident,count
0,Afghanistan,2019,Permanent Resident,1
1,Afghanistan,2019,Temporary Resident,0
2,Afghanistan,2020,Permanent Resident,0
3,Afghanistan,2020,Temporary Resident,0
4,Afghanistan,2021,Permanent Resident,0


## 3. Descriptive Statistics
### 3.1 Total refusals by country

In [3]:
# Total refusals by country
country_totals = df_summary.groupby('country')['count'].sum().sort_values(ascending=False)
country_totals.head(10)

country
Ukraine                       176
Syria                         101
Iran                           70
Bangladesh                     62
Eritrea                        42
Ethiopia                       41
People's Republic of China     41
Russia                         36
India                          32
Sri Lanka                      30
Name: count, dtype: int64

### 3.2 Total refusals by year

In [4]:
# Total refusals by year
year_totals = df_summary.groupby('year')['count'].sum().sort_index()
year_totals

year
2019    221
2020     65
2021     75
2022    129
2023    127
2024    303
Name: count, dtype: int64

### 3.3 Total refusals by resident status

In [5]:
# Total refusals by resident status
resident_totals = df_summary.groupby('resident')['count'].sum()
resident_totals

resident
Permanent Resident    522
Temporary Resident    398
Name: count, dtype: int64

## 4. Visualizations with Plotly
Use Plotly Express for interactive charts.

In [7]:
import plotly.express as px

# Top 10 countries bar chart
top10 = country_totals.head(10).reset_index()
fig1 = px.bar(top10.sort_values('count', ascending=True), x='count', y='country', orientation='h',
              title='Top 10 Countries by Total Refusals (2019-2024)',
              labels={'count':'Total Refusals','country':'Country'})
fig1.show()

### 4.1 Heatmap: Country vs Year

In [8]:
# Get the top-10 countries list
top10_list = top10['country'].tolist()

# Build a pivot table that sums counts for each country-year
heatmap_data = (
    df_summary[df_summary['country'].isin(top10_list)]
    .pivot_table(
        index='country',
        columns='year',
        values='count',
        aggfunc='sum',     # sum over duplicate country-year entries
        fill_value=0
    )
)

# Reorder rows to match the original top-10 ranking
heatmap_data = heatmap_data.reindex(index=top10_list)

import plotly.express as px

# assuming heatmap_data is already built and indexed as beforeâ€¦
fig2 = px.imshow(
    heatmap_data,
    text_auto=True,
    aspect="auto",
    labels=dict(x="Year", y="Country", color="Refusals"),
    title="Heatmap of Refusals for Top 10 Countries by Year",
    color_continuous_scale=["white", "red"],  # white at min, red at max
    zmin=0,                                   # force the lower bound to 0
    zmax=heatmap_data.values.max()           # upper bound to your max value
)
fig2.update_coloraxes(colorbar_title="Refusals")
fig2.show()


### 4.2 Trend by Resident Status

In [9]:
# Line chart of refusals by resident status over years
year_resident = df_summary.groupby(['year','resident'])['count'].sum().reset_index()
fig3 = px.line(year_resident, x='year', y='count', color='resident', markers=True,
               labels={'year':'Year','count':'Refusals','resident':'Resident Status'},
               title='Refusals Trend by Resident Status (2019-2024)')
fig3.show()