# Analysing Students' Mental Health In Python
In this notebook, we will perform exploratory data analysis on a dataset around mental health of domestic and international students. We will look at how social connectedness and cultural issues affect mental health. Finally, we will visualise the results of our analysis using the Python Plotly package.

## The Data

This survey was conducted in 2018 at an international Japanese university and the associated study was published in 2019. It was approved by several ethical and regulatory boards.

The study found that international students have a higher risk of mental health difficulties compared to the general population, and that social connectedness and acculturative stress are predictive of depression.

Social connectedness: measure of belonging to a social group or network.

Acculturative stress: stress associated with learning about and intergrating into a new culture.

[See paper for more info, including data description.](https://www.mdpi.com/2306-5729/4/3/124/htm)

### Load The Libraries

In [1]:
# Import libraries
import pandas as pd 

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

### Clean The Data

Our data is in one table that includes all of the survey data. There are 50 fields and, according to the paper, 268 records. Each row is a student.

In [2]:
students = pd.read_csv('data.csv')

In [3]:
# Check the number of rows in the dataset
students.shape[0]

286

In [4]:
# Inspect the first few rows of the dataset
students.head()

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
0,Inter,SEA,Male,Grad,24.0,4.0,5.0,Long,3.0,Average,...,Yes,Yes,No,No,No,No,No,No,No,No
1,Inter,SEA,Male,Grad,28.0,5.0,1.0,Short,4.0,High,...,Yes,Yes,No,No,No,No,No,No,No,No
2,Inter,SEA,Male,Grad,25.0,4.0,6.0,Long,4.0,High,...,No,No,No,No,No,No,No,No,No,No
3,Inter,EA,Female,Grad,29.0,5.0,1.0,Short,2.0,Low,...,Yes,Yes,Yes,Yes,No,No,No,No,No,No
4,Inter,EA,Female,Grad,28.0,5.0,1.0,Short,1.0,Low,...,Yes,Yes,No,Yes,No,Yes,Yes,No,No,No


In [5]:
# Count the number of students in each inter_dom group, including null values
students['inter_dom'].value_counts(dropna=False)

inter_dom
Inter    201
Dom       67
NaN       18
Name: count, dtype: int64

In [6]:
# Filter rows where inter_dom is null
students[students['inter_dom'].isnull()]

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
268,,,,,,,,,,,...,,,,,,,,,,
269,,,,,,,,,,,...,128.0,137.0,66.0,61.0,30.0,46.0,19.0,65.0,21.0,45.0
270,,,,,,,,,,,...,140.0,131.0,202.0,207.0,238.0,222.0,249.0,203.0,247.0,223.0
271,,,,,,,,,,,...,,,,,,,,,,
272,,,,,,,,,,,...,128.0,137.0,66.0,61.0,30.0,46.0,19.0,65.0,21.0,45.0
273,,,,,,,,,,,...,140.0,131.0,202.0,207.0,238.0,222.0,249.0,203.0,247.0,223.0
274,,,,,,,,,,,...,,,,,,,,,,
275,,,,,,,,,,,...,123.0,,,,,,,,,
276,,,,,,,,,,,...,140.0,,,,,,,,,
277,,,,,,,,,,,...,131.0,,,,,,,,,


In [7]:
# Drop rows where inter_dom is null
students = students.dropna(subset=['inter_dom'])

### Inspect The Data

In [8]:
# See the percentage of students in each group
for col in ['inter_dom', 'Region', 'Gender', 'Academic']:
    print(students[col].value_counts(normalize=True))

inter_dom
Inter    0.75
Dom      0.25
Name: proportion, dtype: float64
Region
SEA       0.455224
JAP       0.257463
EA        0.179104
SA        0.067164
Others    0.041045
Name: proportion, dtype: float64
Gender
Female    0.634328
Male      0.365672
Name: proportion, dtype: float64
Academic
Under    0.921642
Grad     0.078358
Name: proportion, dtype: float64


Most of the participants were international students (75%). Among all students, those from Asia accounted for 96%, with the largest group being 45.52% from South East Asia, followed by 25.74% from Japan as the second-largest group. In terms of gender, female participants accounted for 63.43%, while male students made up 36.57% of the participant pool.

In [9]:
# Describe the Age variable
students['Age'].describe()

count    268.000000
mean      20.873134
std        2.765279
min       17.000000
25%       19.000000
50%       20.000000
75%       22.000000
max       31.000000
Name: Age, dtype: float64

In [10]:
# Build a histogram of the Age column
fig = px.histogram(
    data_frame=students, 
    x='Age', 
    color='Academic',
    title='Age distribution of the students'
)
fig.show()

The age of participants varied between 17 and 31 years old, with the average age being approximately 21 years. The age distribution is skewed towards the right, meaning that there are more participants on the older end of the age range. The majority of students fell within the 18-21 age bracket. This pattern can be attributed to the fact that a small percentage (7.84%) of the participants were graduate students, who tend to be older than the larger group (92.16%) of undergraduate students surveyed

In [11]:
# Build a histogram on the Stay column
fig = px.histogram(
    data_frame=students, 
    x='Stay', 
    title='Stay distribution of the students',
    labels={'Stay': 'Stay duration'}
)
fig.show()

Until the reported time, most participants had been in this university for1 to 3 years.

In [12]:
def create_bar_chart(data, category_column, title, x_label, y_label='Percentage'):
    percentages = data[category_column].value_counts(normalize=True).reset_index()
    percentages.columns = [category_column, 'percentage']
    
    fig = px.bar(
        data_frame=percentages,
        x=category_column,
        y='percentage',
        title=title,
        labels={category_column: x_label, 'percentage': y_label}
    )
    return fig

# Create individual charts
english_fig = create_bar_chart(students, 'English_cate', 'English level of the students', 'English level')
japanese_fig = create_bar_chart(students, 'Japanese_cate', 'Japanese level of the students', 'Japanese level')

# Create a subplot figure
fig = make_subplots(
    rows=1, 
    cols=2, 
    subplot_titles=(
        'English level of the students', 
        'Japanese level of the students'
    ),
    shared_yaxes=True
)

# Add the bar charts to the subplot figure
for trace in english_fig['data']:
    fig.add_trace(trace, row=1, col=1)

for trace in japanese_fig['data']:
    fig.add_trace(trace, row=1, col=2)

# Update layout
fig.update_layout(
    title_text='Comparison of English and Japanese Levels of Students',
    showlegend=False,
    yaxis_title='Percentage'
)

fig.show()

Regarding language proficiency, a majority (61.94%) rated themselves as highly proficient (4 or 5 on a 1-5 scale) in English language ability. Japanese language ability was more varied, with ratings spread equally across low to high proficiency levels.

#### Demographic Conclusion
1. Young adult learners with most participants being traditional undergraduate age.
2. Highly diverse international representation, mostly from South East Asia.
3. A noticeable gender imbalance, with female students outnumbering make students.
4. Regarding language proficiency based on self-evaluation, most participants rated themselves as highly proficient in English, while Japanese language proficiency levels were varied across the group.

### Mental Health Conditions

In [13]:
# Show all the fields in the dataset
students.columns

Index(['inter_dom', 'Region', 'Gender', 'Academic', 'Age', 'Age_cate', 'Stay',
       'Stay_Cate', 'Japanese', 'Japanese_cate', 'English', 'English_cate',
       'Intimate', 'Religion', 'Suicide', 'Dep', 'DepType', 'ToDep', 'DepSev',
       'ToSC', 'APD', 'AHome', 'APH', 'Afear', 'ACS', 'AGuilt', 'AMiscell',
       'ToAS', 'Partner', 'Friends', 'Parents', 'Relative', 'Profess',
       ' Phone', 'Doctor', 'Reli', 'Alone', 'Others', 'Internet', 'Partner_bi',
       'Friends_bi', 'Parents_bi', 'Relative_bi', 'Professional_bi',
       'Phone_bi', 'Doctor_bi', 'religion_bi', 'Alone_bi', 'Others_bi',
       'Internet_bi'],
      dtype='object')

In [14]:
# Build a box plot of the ToAS score by DepType, colored by inter_dom
fig = px.box(
    data_frame=students, 
    x='ToAS', 
    y='DepType', 
    color='inter_dom', 
    title='Box plot of total scores of acculturative stress by depression type',
    labels={
        'ToAS': 'Total Acculturative Stress Score', 
        'DepType': 'Depression Type', 
        'inter_dom': 'International/Domestic'
    }
)
fig.show()

The data indicated that students who reported experiencing major depressive disorder tended to have higher scores for acculturative stress compared to those without depressive disorders or with other types of depressive disorders. It's worth noting that, across all cases, the acculturative stress scores were relatively more elevated for international students in comparison to domestic students.

In [15]:
# List the columns that are continuous variables
continuous_variables = ['Age', 'Stay', 'Japanese', 'English', 'ToDep', 'ToSC', 'APD', 'AHome', 'APH', 'Afear', 'ACS', 'AGuilt', 'AMiscell', 'ToAS', 'Partner', 'Friends', 'Parents', 'Relative', 'Profess', ' Phone', 'Doctor', 'Reli', 'Alone', 'Others', 'Internet']

# Create a subset dataframe only the columns of the continous variables
data_cont = students[continuous_variables]

# Create a pearson correlation
data_corr = data_cont.corr(method='pearson')

# Build the Heatmap
fig = go.Figure(go.Heatmap(
    x=data_corr.columns, 
    y=data_corr.columns, 
    z=data_corr.values.tolist(), 
    zmin=-1, 
    zmax=1
))

# Adjust the plot size
fig.update_layout(width=900, height=900)

# Show the plot
fig.show()