<a href="https://www.kaggle.com/code/mcpenguin/thailand-drug-offenses-eda?scriptVersionId=143235112" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Thailand Drug Offenses - EDA

In this notebook, we investigate the Thailand Drug Offenses dataset using a variety of data analysis techniques.

# Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

import matplotlib.pyplot as plt
import matplotlib.colors as colors

import seaborn as sns

from urllib.request import urlopen

import geojson
import geopandas as gpd
import geoplot as gplt
import geoplot.crs as gcrs

from ipywidgets import interact

# Load Dataset

In [None]:
df = pd.read_csv("/kaggle/input/thailand-drug-offenses-2017-2022/thai_drug_offenses_2017_2022.csv")
df.head()

Whenever working with data, it is important to understand what the different column variates mean. From the dataset brief, we get the following information about the columns:

* `fiscal_year`: The fiscal year during which the drug offenses were recorded.

* `types_of_drugs_offenses`: The specific type or category of drug offense being reported. The types include drug use cases, suspects in drug use cases, possession cases, suspects in possession cases, possession with intent to distribute cases, suspects in possession with intent to distribute cases, trafficking cases, suspects in trafficking cases, production cases, suspects in production cases, import cases, suspects in import cases, export cases, suspects in export cases, conspiracy cases, and suspects in conspiracy cases.

* `no_cases`: The total number of cases recorded for the specific combination of fiscal year, type of drug offense, and province.

* `province_th`: The name of the province in Thailand, written in Thai.

* `province_en`: The name of the province in Thailand, written in English.

Let's also get information about the dataset - namely the number of non-null values and shape of the dataset:

In [None]:
df.info()

We see that there are no missing values. We should also identify the unique values for each categorical column, including fiscal year:

In [None]:
for col in ["fiscal_year", "types_of_drug_offenses", "province_th", "province_en"]:
    print(f"For column {col}, there are {len(df[col].unique())} unique values:")
    print(list(df[col].unique()))
    print()

# Exploratory Data Analysis

For this notebook, we will use the geojson for Thailand's provinces, which is available in [this GitHub repository](https://gist.github.com/jeepkd/4e31e6a10f8297b9de50c62856927ecf?short_path=ac424e6).

In [None]:
geojson_link = "https://gist.githubusercontent.com/jeepkd/4e31e6a10f8297b9de50c62856927ecf/raw/9899d9f1ca4cd7c5f103a9b2455d9a01f0c8f895/thailand.json"
with urlopen(geojson_link) as url:
    gdf = gpd.read_file(url)

## Fiscal Year

We first investigate the amount of data across each fiscal year.

In [None]:
sns.countplot(data=df, x="fiscal_year")
plt.show()

From this, we see that the amount of data for each fiscal year is the same across all 6 years. We might also expect that the amount of data for each fiscal year is the same across the different drug categories as well:

In [None]:
plt.figure(figsize=(10, 10))
sns.countplot(data=df, x="fiscal_year", hue="types_of_drug_offenses")
plt.show()

We see that our suspicion is confirmed. Specifically, the count for each `(year, province)` pair is exactly the number of provinces in Thailand, which is 77.

## Drug Use Cases

Let's investigate the drug use cases across the different fiscal years and provinces. We can first plot the aggregate data for each fiscal year across all the provinces:

In [None]:
df_grouped = df[df["types_of_drug_offenses"].eq("drug_use_cases")].groupby(by="fiscal_year")["no_cases"].sum()
sns.lineplot(data=df_grouped)
plt.show()

We might also want to get the corresponding data for each province as well:

In [None]:
plt.figure(figsize=(15,30))
df_filtered = df[df["types_of_drug_offenses"].eq("drug_use_cases")]
sns.lineplot(data=df_filtered, x="fiscal_year", y="no_cases", hue="province_en")
plt.show()

## Preprocessing GeoJSON Data

This graph is understandably very messy, given that we are plotting data for 77 provinces. A better visualization might be to use a heatmap of the provinces in Thailand for each fiscal year.

Let's first see how our province map looks like:

In [None]:
gplt.polyplot(
    gdf, 
    projection=gcrs.AlbersEqualArea(), 
    edgecolor='black', facecolor='lightgrey', linewidth=.3,
    figsize=(12, 8))
plt.show()

Our GeoJSON file looks like this:

In [None]:
gdf.head()

From the looks of things, the province name information is stored in the column `NAME_1`.

Let's see how many rows are in our geo-dataframe:

In [None]:
gdf.shape

This matches the number of provinces in the original dataset, which should be expected.

In order to merge this location information with our dataset, we need the province names to match in both datasets. Thus, we should check whether there are any discrepancies with the province names:

In [None]:
diff1 = [x for x in df['province_en'].unique() if x not in gdf['NAME_1'].unique()]
diff2 = [x for x in gdf['NAME_1'].unique() if x not in df['province_en'].unique() ]

print(f"In df but not in gdf: {diff1}")
print(f"In gdf but not in df: {diff2}")

From these differences, we can see that

* Our dataset has `Bangkok` but this is named as `Bangkok Metropolis` in the geo-dataframe;
* Our dataset has `Loburi` but this is named as `Lop Buri` in the geo-dataframe;
* Our dataset has `buogkan` but this is named as `Bueng Kan` in the geo-dataframe;

We can then rename the relevant entries in the geo-dataframe to reflect the province names in the dataset.

In [None]:
gdf.loc[gdf["NAME_1"] == "Bangkok Metropolis", "NAME_1"] = "Bangkok"
gdf.loc[gdf["NAME_1"] == "Lop Buri", "NAME_1"] = "Loburi"
gdf.loc[gdf["NAME_1"] == "Bueng Kan", "NAME_1"] = "buogkan"

We can then merge our geo-dataframe with our dataset to create one giant dataset:

In [None]:
gdf_new = gdf.merge(df.rename(columns={"province_en": "NAME_1"}), on="NAME_1")
gdf_new.head()

If our merging was successful, we should expect the number of rows in the new geodataframe to match the number of rows in our original dataframe:

In [None]:
gdf_new.shape[0] == df.shape[0]

## Map Visualizations of Variates

We can then plot choropleth maps displaying the various variates of the data using an interactive display. Feel free to play around with the columns and years.

In [None]:
proj = gcrs.AlbersEqualArea()

cols = df["types_of_drug_offenses"].unique()
years = range(2017, 2023)

def on_trait_change(year, col):
    gdf_filtered = gdf_new[(gdf_new["fiscal_year"] == year) & (gdf_new["types_of_drug_offenses"] == col)]
    gplt.choropleth(
        gdf_filtered,
        hue="no_cases",
        projection=gcrs.AlbersEqualArea(), 
        edgecolor='black',
        cmap='Reds',
        norm=colors.LogNorm(vmin=gdf_filtered["no_cases"].min(), vmax=gdf_filtered["no_cases"].max()),
        legend=True,
        linewidth=.3)
    
interact(on_trait_change, year=list(years), col=list(cols))