# An analytic perspective on Income, Race and Drug use.
Kiefer Plender, Remolo van de Plassen, Ouail Moukthari, Huub Al

<font color="red" size=2px>Disclaimer</font><font size=2px>: The plotly plots in our jupyterbook do not have annotations within the graph. This is done on purpose because the responsive nature of the plotly library made the captions of our figures inconsistent for different screen sizes. Instead we used markdown for the figure captions.</font>

## 1.1 Introduction


Drug abuse is a hard and intricate issue affecting big parts of modern society. Stepping away from bias and stereotypes, our data story wishes to provide some clear, yet distinct, views on drug abuse. Presenting two different perspectives on drug abuse, trying to provide a wide view of the topic.

Our first perspective investigates whether or not individuals that belong to a racial minority group are more likely to abuse illicit drugs. Following the narrative that these people might have more challenges in day-to-day life, such as financial problems or fewer job opportunities (Darity Jr., W. A., Hamilton, D., & Dietrich, J. (2018)). Due to the nature of drugs (specifically downers), we think these people might pick up drug habits to deal with these problems earlier than more well-off individuals and/or different races. The second perspective suggests a broader view of the overall topic. It states that drug use is a universal problem and factors like race or income do not play a direct role. Individuals with lower incomes may be more vulnerable to drug abuse, but low income isn't the only factor that contributes to this statistic. Our data study relies on the notion that we can attribute the issue to more general factors, like peer pressure or general sensitivity to addiction. 


When reviewing these two perspectives, we aim to present a more nuanced view on drug abuse and its victims. Challenging the current stereotypes and stigmas associated with drug abuse can create a society that is educated and supports victims affected by this issue ([Livingston, Milne, Fang, & Amari, 2012](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1360-0443.2011.03601.x)).




## 1.2 Dataset and preprocessing

In pursuit of providing a clear overview, we decided to use a large dataset from the 2015 National Survey on Drug Use and Health. The survey captures a representative general view of the USA adult population. Due to the overall completeness and significant amount of variables the data story will be solely based on this dataset, and the necessary academic papers to support our findings.

Fortunately, the dataset contained very clear data that didn't require much pre-processing to be usable. However, due to it being survey data the findings were of the binary type and needed to be translated to their corresponding real-world values. We had to utilise the Legenda to provide a more intuitive interpretation. As such we converted variables like sex which have a value of 1 or 2, to the corresponding nominal values like 'Male' or 'Female'. Other than this process of translating there wasn't much need for preprocessing for the creating the figures. 

## 1.3 Visualisations

### Import of packages and reading our dataset

In [93]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np

df = pd.read_csv('nsduh_workforce_adults.csv')

### First visualisation ( Bar Plot: Drug usage by race and sex): 

This bar chart plot describes the average drug usage rate grouped by race and sex. The y-axis denotes the drug usage in % and the x-axis different race groups. For each race group, there is a further diversification based on sex, which in this case is either Male or Female. Specifying the data point towards Male or Female is due to gender being a possible contribution to minority or prejudice. It's clear some races generally have higher drug usage, but this is not the main takeaway of this plot. Looking at the proportions of Male drug users to female drug users is the main interest of this plot. For Asian and Mixed groups there is not much difference per sex, but for the Black/African American race there is a big difference in drug usage between sexes. These findings are in line with what is known about differences in substance abuse between genders (Lambert, Brown, Phillips, & Ialongo, 2004). It is not uncommon for African American male adolescents to fall victim to peer pressure more than females. Another reason for this is the role family dynamics play in the prevention of drug abuse for females, such as helping raising children and setting an example for them. Obviously there are a lot more factors that might play a role in this significant difference. Notable is that for all races males have higher drug use rates. The explanation for this is complicated, but it could also be due to difference in stereotypical gender roles, as the aforementioned larger difference between sexes for African Americans.

In [94]:
df = pd.read_csv('nsduh_workforce_adults.csv')

df_grouped = df.groupby(['race_str', 'sex'])['anydrugever'].mean().reset_index()
df_grouped.sort_values('race_str', inplace=True)

races = df_grouped['race_str'].unique()

male_df = df_grouped[df_grouped['sex'] == 1]
female_df = df_grouped[df_grouped['sex'] == 2]

trace1 = go.Bar(x=races, y=male_df['anydrugever'].values * 100, name='Male')
trace2 = go.Bar(x=races, y=female_df['anydrugever'].values * 100, name='Female')

layout = go.Layout(
    title='Drug Usage by Race and Sex',
    xaxis=dict(title='Race'),
    yaxis=dict(title='Drug Usage (%)', dtick=10), 
    barmode='group'
)

fig = go.Figure(data=[trace1, trace2], layout=layout)

fig.show()


<font size=2px>
<center> <strong>Figure 1.</strong>
<center> The races of the respondents are on the x-axis and the percentage of those that have used drugs in their lifetime on the y-axis.
<center> Helped by the GPT-4 prompt: Help me to create a bar plot to show the Drug Usage by Race and Sex (in %) for race and drug use with Plotly, use arbirtary column names. 17-6-23
</font>

### Second visualisation ( Heat map: Percentage of Drug Use (Ever) by Race ): 

This plot shows the percentage of people of different ethnicities that ever used a certain type of drug. On the y-axis, are the different types of ethnicities, and on the x-axis are different types of drugs. This plot shows that marijuana is by far the drug that most people have ever tried, and crack and heroin are the drug that the least people have ever used.  Native Americans seem to use some types of drugs the most out of all races: cocaine, crack, hallucinogen, inhalant, meth, and tranquilizers. According to a medically reviewed article by the American Addiction Center, this is a well-known problem among Native Americans. It could potentially be explained by historical trauma, violence (including high levels of gang violence, domestic violence, and sexual assault), poverty, high levels of unemployment, discrimination, racism, lack of health insurance, or low levels of attained education (Kaliszewski, M. (2022)). Another finding is that Asian people have tried a lot fewer drugs than other races.

In [95]:
df = pd.read_csv('NSDUH_Workforce_Adults.csv')

variables = ['marij_ever', 'cocaine_ever', 'crack_ever', 'heroin_ever', 'hallucinogen_ever',
             'inhalant_ever', 'meth_ever', 'painrelieve_ever', 'tranq_ever', 'stimulant_ever']

full_names = {
    'marij_ever': 'Marijuana',
    'cocaine_ever': 'Cocaine',
    'crack_ever': 'Crack',
    'heroin_ever': 'Heroin',
    'hallucinogen_ever': 'Hallucinogen',
    'inhalant_ever': 'Inhalant',
    'meth_ever': 'Methamphetamine',
    'painrelieve_ever': 'Pain Reliever',
    'tranq_ever': 'Tranquilizer',
    'stimulant_ever': 'Stimulant'
}

total_counts = df['race_str'].value_counts()

counts = df.groupby('race_str')[variables].sum()

counts = counts.rename(columns=full_names)

proportions = counts.div(total_counts, axis=0) * 100
proportions = proportions.round(2)

fig = px.imshow(proportions, labels=dict(x="Type of drug", y="Race", color="Percentage"),
                title="Percentage of Drug Use (Ever) by Race", color_continuous_scale='YlOrRd',
                zmin=0, zmax=100)

annotations = []
for i in range(len(proportions)):
    for j in range(len(proportions.columns)):
        annotations.append(dict(
            x=j,
            y=i,
            text=str(proportions.iloc[i, j]) + '%',
            showarrow=False,
            font=dict(color='black', size=8)  
        ))

fig.update_layout(annotations=annotations)
fig.update_xaxes(side="top")

fig.show()


<font size=2px>
<center> <strong>Figure 2.</strong>
<center> The races of the respondents are on the y-axis and the differnt types of drugs on the x-axis. Each box represents the percentage of those that have used that drug in their lifetime of that race.
<center> Helped by the GPT-4 prompt: Help me to create a heatmap plot to show the proportions for race and drug use with Plotly. 18-6-23

</font>

### Third visualisation ( Correlation Plot: Income, Education, and Drugs): 

Our expectations beforehand were that people with lower incomes are more likely to have used different types of drugs based on their economic and social circumstances. However, something else appears to emerge from the correlation plot based on our data. First, we only looked at the correlation between 'countofdrugs_ever' and 'personalincome', 'familyincome', and 'education'. However, we soon found that there was no correlation.That is why we finally added 'countofdrugs_month' and 'countofdrugs_year' to see if our expectations that we had in the beginning are correct. As can be seen from the correlation plot, there is no clear correlation between the variety of drug use and income and education.

In [99]:
df = pd.read_csv('nsduh_workforce_adults.csv')

columns = ['PersonalIncome', 'FamilyIncome', 'education', 'countofdrugs_ever', 'countofdrugs_month', 'countofdrugs_year']
selected_data = df[columns]

correlation_matrix = selected_data.corr()

fig = px.imshow(correlation_matrix.loc[['countofdrugs_ever', 'countofdrugs_month', 'countofdrugs_year'], :],
                labels=dict(color="Correlation"), color_continuous_scale='YlOrRd')


fig.update_layout(
    title='Correlation Plot: Income, Education, and Drugs',
)

fig.show()

<font size=2px>
<center> <strong>Figure 3.</strong>
<center> Socio-economic factors along with the variety of drug usage in different time periods on the x-axis and just the drug usage on the y-axis. Each box corresponds to the correlation between the two variables in the dataset.
<center> Helped by the ChatGPT prompt: Maak een willekeurige correlatie plot gebaseerd op 4 verschillende data die ik zelf moet invoeren. 20-6-23

</font>

### Fourth visualisation ( Parallel coordinates Plot: Income, Education, and Drugs): 

We were hoping for a more clear visualization of the connection between income and drug use via a parallel categories plot. The idea was that we could maybe visualize the most common combinations of socio-economic factors, such as education and income, that lead to a higher use of drugs. The first plot here is to show the combinations of factors for all people. This graph is not very significant since obviously most people don't use a lot of drugs at all. The bins in the graph are the following: Low = 0-3 different drugs ever used, medium = 4 - 6 drugs and high = 7+. In the second graph only the high and medium groups are shown. This graph shows the desired visualization. But just like the last section, the combinations of variables leading to a high variety in drug-usage seem to be very random and not related to income or education at all. An interesting result is a slightly larger proportion of above average drug users seem to be coming from rich families. This might be an inaccurate representation of the real world due to the filtering of the data, but there could also be an explaination for this. For example the unforseen complications of nepotism or the neglect of some children from rich families. Although these are very risky speculations that cannot be concluded from this data, it might be an interesting topic for follow-up research.

In [97]:
df = pd.read_csv('nsduh_workforce_adults.csv')

# Column names
columns = ['race_str', 'PersonalIncome', 'education', 'countofdrugs_ever', 'FamilyIncome']

# Create DataFrame
df = pd.DataFrame(df, columns=columns)

# Using qcut
df['amount_drugs_qcut'], qcut_bins = pd.cut(df['countofdrugs_ever'], bins=3, labels=['Low', 'Medium','High'], retbins=True)
print("Bins for cut:", qcut_bins)

# filter rows with only high and medium drug use.
df_filtered = df[df['amount_drugs_qcut'].isin(['Medium', 'High'])]
# Create Parallel Categories plot
parcatsall = go.Figure(data=[go.Parcats(dimensions=[
    {'label': 'Personal Income', 'values': df['PersonalIncome'], 'categoryorder': 'category ascending'},
    {'label': 'Education', 'values': df['education'], 'categoryorder': 'category ascending'},
    {'label': 'Family Income', 'values': df['FamilyIncome'], 'categoryorder': 'category ascending'},
    {'label': 'Drug Use', 'values': df['amount_drugs_qcut']},
],
    line={'color': df['amount_drugs_qcut'].map({'Low': 'lightblue','Medium': 'lightgreen', 'High': 'orangered'})},
    labelfont={'size': 12},
    tickfont={'size': 12},
    arrangement='freeform'
)],
    layout={'title': 'Analysis of Income, Education, and Drug Use'})

parcatsall.show()

# Create Parallel Categories plot
parcats = go.Figure(data=[go.Parcats(dimensions=[
    {'label': 'Personal Income', 'values': df_filtered['PersonalIncome'], 'categoryorder': 'category ascending'},
    {'label': 'Education', 'values': df_filtered['education'], 'categoryorder': 'category ascending'},
    {'label': 'Family Income', 'values': df_filtered['FamilyIncome'], 'categoryorder': 'category ascending'},
    {'label': 'Drug Use', 'values': df_filtered['amount_drugs_qcut']},
],
    line={'color': df_filtered['amount_drugs_qcut'].map({'Medium': 'lightgreen', 'High': 'orangered'})},
    labelfont={'size': 12},
    tickfont={'size': 12},
    arrangement='freeform'
)],
    layout={'title': 'Analysis of Income, Education, and Drug Use'})

# Show plot
parcats.show()

Bins for cut: [-0.01        3.33333333  6.66666667 10.        ]



<font size=2px>
<center> <strong>Figure 4.</strong>
<center> Each bar represents a bin within the variable it belongs to. The size of the bins equal the proportion of data that belongs to it within that variable. We can represent combinations of variables this way to visualize the most common socio-economic factors leading to high drug usage. The top graph represent the entire population and the bottom graph represents only the above average drug users. The devision of the Personal Income bins are the following (salary per year): 1 = Less than $10,000 (Including Loss), 2 = $10,000 - $19,999, 3 = $20,000 - $29,999, 4 = $30,000 - $39,999, 5 = $40,000 - $49,999, 6 = $50,000 - $74,999, 7 = $75,000 or more. The devision of the education bins are the following: 1 = Less than high school, 2 = High school graduate, 3 = Some college degree/associate, 4 = College graduate, 5 = 12 to 17 year olds. The Family income bins are the same as the Personal Income bins. The drug bins are distributed as follows: Low = 0-3 different drugs ever used, medium = 4 - 6 drugs and high = 7+.
<center> Helped by the ChatGPT prompt: Create a sample of a parallel categories graph with 4 variables using Plotly. 19-6-23

</font>

### Draft Graph (Sunburst Graph:Crack Use and Education Level):

This visualization shows the relation between crack use and education level. Crack is a substance that is known to be highly addictive. Therefore, this plot could be interesting to look at the relation between addiction and education.

In [98]:
data = pd.read_csv('nsduh_workforce_adults.csv')  # Skip the first row with column names

data['crack_ever_bool'] = data['crack_ever'].map({1: 'Used', 0: 'Did Not Use'})  # Assuming 1 represents True and 0 represents False

fig = px.sunburst(data, path=['education', 'crack_ever_bool'], title='Crack Use Across Education Levels')

fig.show()


<font size=2px>
<center> <strong>Figure 5.</strong>
<center> Sunburst graph for the visualization of drug use across education levels. The numbers in the inner circle represent education levels via the following: 1 = Less than high school, 2 = High school graduate, 3 = Some college degree/associate, 4 = College graduate, 5 = 12 to 17 year olds.
<center> Helped by the GPT-4 prompt: Help me to create a sunburst plot to show the relation between crack use and education based on my data, 18-6-23
</font>

## Mariujana use last 30 days and family income

In [101]:
# Read the CSV file
data = pd.read_csv('NHANES1718.csv')

# Calculate the frequency of combinations
combination_freq = data.groupby(['DUQ230', 'IND235']).size().reset_index(name='Frequency')

# Create a bubble plot using Plotly
fig = px.scatter(combination_freq, x='DUQ230', y='IND235', size='Frequency')

# Set the axis labels
fig.update_xaxes(title='DUQ230')
fig.update_yaxes(title='IND235')

# Show the plot
fig.show()

## Cocaine use this month and Income

In [106]:
variable = 'DUQ420'
# Read the CSV file
data = pd.read_csv('NHANES1718.csv')

# Calculate the frequency of combinations
combination_freq = data.groupby([variable, 'IND235']).size().reset_index(name='Frequency')

# Create a bubble plot using Plotly
fig = px.scatter(combination_freq, x=variable, y='IND235', size='Frequency')

# Set the axis labels
fig.update_xaxes(title=variable)
fig.update_yaxes(title='IND235')

# Show the plot
fig.show()

## Appendix



<ul>
    <li> Livingston, J. D., Milne, T., Fang, M. L., & Amari, E. (2011). <em>The effectiveness of interventions for reducing stigma related to substance use disorders: a systematic review </em>. Addiction, 106(10), 1786-1796. doi:10.1111/j.1360-0443.2011.03601.x
    <li> Darity Jr., W. A., Hamilton, D., & Dietrich, J. (2018). <em>The Persistent Effect of Race and the Legacy of Slavery on Income Inequality in the United States</em>. Review of Black Political Economy, 45(1), 29-60. doi:10.1007/s12114-017-9250-9
    <li> Lambert, S. F., Brown, T. L., Phillips, C. M., & Ialongo, N. S. (2004). <em>Gender and Ethnic Differences in the Predictors of Drug Use among African American Adolescents</em>. Journal of Youth and Adolescence, 33(5), 373-387. doi:10.1023/B:JOYO.0000032675.06729.f0
    <li> Kaliszewski, M. (2022, September 12). <em>Alcohol and Drug Abuse Among Native Americans</em>. Retrieved from <a href='https://americanaddictioncenters.org/rehab-guide/addiction-statistics/native-americans#'> American Addiction Centers </a>
    <li> OpenAI. (2021). <em>ChatGPT: Language Model</em>. <a href='https://openai.com/'> Open AI </a>
    
</ul>