<p  style="text-align: center;"><font size="10"><b>DRUG EFFICACY ON MEMORY</b></font></p>
<p  style="text-align: center;"><font size="4">EXPLORATORY DATA ANALYSIS & CLUSTERING</font></p>

<img src='https://github.com/miltonsuggs/06-memory-test/blob/main/memory%20test.jpg?raw=true' alt='MEMORY TEST'>

# INTRODUCTION

An experiment on the effects of anti-anxiety medicine on memory recall when being primed with happy or sad memories. The participants were done on novel Islanders whom mimic real-life humans in response to external factors.

Drugs of interest (known-as) [Dosage 1, 2, 3]:

A - Alprazolam (Xanax, Long-term) [1mg/3mg/5mg]

T - Triazolam (Halcion, Short-term) [0.25mg/0.5mg/0.75mg]

S- Sugar Tablet (Placebo) [1 tab/2tabs/3tabs]

*Dosages follow a 1:1 ratio to ensure validity
*Happy or Sad memories were primed 10 minutes prior to testing
*Participants tested every day for 1 week to mimic addiction


<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Table of Contents</h3>

1. [Libraries & Packages](#libraries)
2. [Initial Insights](#insights)
3. [Data Preprocessing & Feature Engineering](#preprocessing)
4. [Data Exploration & Visualization](#exploration)  
    A. [Univariate Exploration](#univariate)  
    B. [Bi/Multivariate Exploration](#multivariate)  
      I. [Memory Score Comparisons](#memscore)  
      II. [Difference Comparisons](#diff)  
      III. [Difference Category Analysis](#diffcat)  
5. [Additional Feature Engineering](#features)  
6. [Clustering](#clustering)  
    A. [K-Means Clustering](#kmeans)  
    B. [Hierarchical Clustering](#hierarchical)  
    
<!-- 7. [Algorithm Comparison](#comparison)
8. [Conclusion](#conclusion)  -->

<a id="libraries"></a>
## LIBRARIES & PACKAGES

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import missingno as msno
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="insights"></a>

## INITIAL INSIGHTS

In [None]:
df = pd.read_csv('../input/memory-test-on-drugged-islanders-data/Islander_data.csv')
df.head()

In [None]:
df.dtypes

In [None]:
df.info()

In [None]:
# PRINT UNIQUE VALUES FOR EACH COLUMN

for column in df.columns:
        print(column)
        print(df[column].unique())
        print('')
    

## MISSING VALUES

THIS DATASET CONTAINS NO MISSING VALUES

In [None]:
missing_percentage=df.isna().sum()*100/df.shape[0]
missing_percentage

<a id="preprocessing"></a>

## PREPROCESSING & FEATURE ENGINEERING

For certain continuous variables I like to bin them into categorical variables to add a different perspective in the exploration. In this instance "Age" and "Diff are two continuous variables that will benefit from being binned into a separate categorical variables. 

Below I created new variables "age_cat" and "diff_cat". **"Age_cat"** will separate each patient into an age group: "young adult", "middle age", or "senior adult".  **"Diff_cat"** will categorize the values in the "Diff" column as "increase", "decrease", or "no change"

I created a new column with each patient's full name to ensure that each patient is uniquely identifiable in any exploration. 

In [None]:
# BIN AGE GROUPS and DIFF CATEGORIES

df['age_cat'] = np.nan 
df['diff_cat'] = np.nan


for col in [df]:
    col.loc[(col['age'] >= 18) & (col['age'] <= 35), 'age_cat'] = 'young adult'
    col.loc[(col['age'] > 35) & (col['age'] <= 55), 'age_cat'] = 'middle age'
    col.loc[col['age'] > 55, 'age_cat'] = 'senior adult'
    
    col.loc[col['Diff'] > 0, 'diff_cat'] = 'increase'
    col.loc[col['Diff'] < 0, 'diff_cat'] = 'decrease'
    col.loc[col['Diff'] == 0, 'diff_cat'] = 'no change'

    
# CREATE FULL NAME COLUMN

df['full_name']= df['first_name'] + ' ' + df['last_name']

    

In [None]:
# DROP FIRST_NAME & LAST_NAME COLUMNS 
df.drop(columns=['first_name', 'last_name'])

# REORDER COLUMNS
df = df[['full_name', 'age', 'age_cat', 'Happy_Sad_group', 'Dosage', 'Drug', 
         'Mem_Score_Before', 'Mem_Score_After', 'Diff', 'diff_cat']]

<a id="exploration"></a>

# EXPLORATORY DATA ANALYSIS

In this section we visualize our data and see what insights may be gleaned. 

<a id="univariate"></a>

## UNIVARIATE ANALYSIS

In this section I look at some prelimenary insights that the data has to offer. Using bar charts, histograms, pie charts, and box plots I observe how each variable is distributed.

**AGE & AGE_CAT**

The majority of patients fall into the young adult (below 35) and middle age (36-55) categories, with young adults slightly out numbering middle age. Seniors occur far less frequently with only 18 out of the 198 participants falling into that category.

**HAPPY/SAD GROUP**

The patients are evenly distributed among the "happy_sad_group" variable. In this variable, each patient was primed with happy or sad memories 10 minutes before testing. 

**DOSAGE AND DRUG DISTRIBUTION**

There is a fairly even distrubution of each type of drug and number of doses among the patient population. 

**MEMORY SCORES AND DIFFERENCE CATEGORIES**

Overall there is a general increase in memory score as indicated in the box plots of the "Memory_Score_Before" and "Memory_Score_After" variables and the "Diff_cat" variable. The Before scores range from a minimum value of 27.2 to an upper fence value of 100 and maximum outlier value of 110. The After scores range from a minimum value of 27.1 to an upper fence value of 108 and maximum outlier value of 120.







In [None]:
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Histogram(x=df['age'], name='AGE',xbins=dict(start=20, end=90, size=5)), 
              row=1, col=1)

fig.add_trace(go.Histogram(x=df['age_cat'], name='AGE CATEGORES'), row=2, col=1)

fig.update_layout(height=1000, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="AGE AND AGE CATEGORY COUNTS")
fig.show()

In [None]:
fig = px.pie(df, values=df.index, names='Happy_Sad_group')
fig.update_layout(title_text='Happy/Sad Distribution')
fig.show()

In [None]:
fig = px.histogram(df, x="Dosage")

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="DOSAGE DISTRIBUTION")

fig.show()

In [None]:
fig = px.histogram(df, x="Drug")

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="DRUG DISTRIBUTION")

fig.show()

In [None]:
y0 = df['Mem_Score_Before']
y1 = df['Mem_Score_After']

fig = go.Figure()

fig.add_trace(go.Box(y=y0, name='Before'))
fig.add_trace(go.Box(y=y1, name='After'))

fig.show()

In [None]:
fig = px.histogram(df, x="diff_cat")

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="DIFFERENCE CATEGORIES")

fig.show()

<a id="multivariate"></a>

## BIVARIATE & MULTIVARIATE ANALYSIS

In this section I take a more in depth look at the variables and observe any correlations between any of the variables. By performing bivariate and multivariate analyses we can determine which variables may be having the most effect on the change in Memory Score. 

<a id="memscore"></a>

### MEMORY SCORE COMPARISONS

Aside from name, the only patient information that we have to work with is age. While it would be preferable to have another variable such as sex to provide more insight, age may give an indication of how memory score might be impacted in this study. 

In [None]:
# Use pd.melt to transform our dataframe and make it more usable for creating the following visualizations

df_melt = pd.melt(df, id_vars=['Happy_Sad_group', 'age', 'age_cat'], value_vars=['Mem_Score_Before', 'Mem_Score_After'])
df_melt.rename(columns={'variable':'Mem_Score'}, inplace=True)

In [None]:
fig = px.box(df_melt, x="age_cat", y='value', color='Mem_Score', points="all")
fig.update_layout(height=500, 
                  width=900, 
                  title_text="Mem Score Before vs Age")
fig.show()



**OBSERVATIONS**

* Although there are far less samples in the "senior adult" age group, their memory score before values are generally higher than the other two age categories. 
* The value distribution of the young adult and middle age categories are comparable.
* The young adults saw more of an increase in memory score than the other categories. 

In [None]:
x0=df_melt['Happy_Sad_group'].loc[df_melt['Mem_Score'] == 'Mem_Score_Before']
x1=df_melt['Happy_Sad_group'].loc[df_melt['Mem_Score'] == 'Mem_Score_After']

y0 = df_melt[['value']].loc[df_melt['Mem_Score'] == 'Mem_Score_Before']
y1 = df_melt[['value']].loc[df_melt['Mem_Score'] == 'Mem_Score_After']

fig = go.Figure()

fig.add_trace(go.Box(y=y0['value'], x=x0, name='Before', marker_size=3,  boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y1['value'], x=x1, name='After', marker_size=3, boxpoints="all", boxmean=True))

fig.update_layout(height=600, 
                  width=1000,
                  title_text='Mem_Score x Happy/Sad Group',
                  yaxis_title='Mem_Score',
                  boxmode='group',
                  yaxis=dict(autorange=True,
                             showgrid=True,
                             zeroline=True,
                             dtick=5,
                             gridcolor='rgb(255, 255, 255)',
                             gridwidth=1,
                             zerolinecolor='rgb(255, 255, 255)',
                             zerolinewidth=2,)
)

fig.show()

** OBSERVATIONS **

* The sad group experienced a more significant increase in memory score than the happy group.

In [None]:
fig = px.box(df, x="Drug", y="Mem_Score_After", color='Drug', points="all")
fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="Mem Score After vs Drug")
fig.show()

In [None]:
fig = px.scatter(df, x="Mem_Score_Before", y="Mem_Score_After", 
                 color="Drug", size='Dosage', 
                 template='plotly_dark')
fig.show()

In [None]:
drug_mem_avg = df.groupby(['Drug'])[['Mem_Score_Before', 'Mem_Score_After']].agg('mean')
drug_mem_avg.reset_index(inplace=True)
drug_mem_avg = pd.melt(drug_mem_avg, id_vars=['Drug'], value_vars=['Mem_Score_Before', 'Mem_Score_After'])

drug = ['Drug A', 'Drug T', 'Drug S']
before = drug_mem_avg['value'].loc[drug_mem_avg['variable'] == 'Mem_Score_Before']
after = drug_mem_avg['value'].loc[drug_mem_avg['variable'] == 'Mem_Score_After']
# drugs = drug_mem_avg['value'].loc[drug_mem_avg['Drug'] == 'S']

fig = go.Figure(data=[go.Bar(name='Mem Before', x=drug, y=before, marker_color='mediumvioletred'),
                      go.Bar(name='Mem After', x=drug, y=after, marker_color='dodgerblue')])

# Change the bar mode
fig.update_layout(barmode='group', 
                  title_text="Mem Score Comparison by Drug Type")
fig.show()

In [None]:
x0=df['Drug'].loc[df['Dosage'] == 1]
x1=df['Drug'].loc[df['Dosage'] == 2]
x2=df['Drug'].loc[df['Dosage'] == 3]

y0 = df[['Mem_Score_Before']].loc[df['Dosage'] == 1]
y1 = df[['Mem_Score_Before']].loc[df['Dosage'] == 2]
y2 = df[['Mem_Score_Before']].loc[df['Dosage'] == 3]


fig = go.Figure()

fig.add_trace(go.Box(y=y0['Mem_Score_Before'], x=x0, name='1 Dose', marker_size=3,  boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y1['Mem_Score_Before'], x=x1, name='2 Doses', marker_size=3, boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y2['Mem_Score_Before'], x=x2, name='3 Doses', marker_size=3, boxpoints="all", boxmean=True))

fig.update_layout(height=500, 
                  width=1000,
                  title_text='Mem Score Before x Drug & Dosage',
                  yaxis_title='Mem Score Before',
                  boxmode='group',
                  yaxis=dict(autorange=True,
                             showgrid=True,
                             zeroline=True,
                             dtick=10,
                             gridcolor='rgb(255, 255, 255)',
                             gridwidth=1,
                             zerolinecolor='rgb(255, 255, 255)',
                             zerolinewidth=2,)
)

fig.show()

In [None]:
x0=df['Drug'].loc[df['Dosage'] == 1]
x1=df['Drug'].loc[df['Dosage'] == 2]
x2=df['Drug'].loc[df['Dosage'] == 3]

y0 = df[['Mem_Score_After']].loc[df['Dosage'] == 1]
y1 = df[['Mem_Score_After']].loc[df['Dosage'] == 2]
y2 = df[['Mem_Score_After']].loc[df['Dosage'] == 3]


fig = go.Figure()

fig.add_trace(go.Box(y=y0['Mem_Score_After'], x=x0, name='1 Dose', marker_size=3,  boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y1['Mem_Score_After'], x=x1, name='2 Doses', marker_size=3, boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y2['Mem_Score_After'], x=x2, name='3 Doses', marker_size=3, boxpoints="all", boxmean=True))

fig.update_layout(height=500, 
                  width=1000,
                  title_text='Mem Score After x Drug & Dosage',
                  yaxis_title='Mem Score After',
                  boxmode='group',
                  yaxis=dict(autorange=True,
                             showgrid=True,
                             zeroline=True,
                             dtick=10,
                             gridcolor='rgb(255, 255, 255)',
                             gridwidth=1,
                             zerolinecolor='rgb(255, 255, 255)',
                             zerolinewidth=2,)
)

fig.show()

#### **OBSERVATIONS**

* There is an overall positive trend in memory score as indicated by the scatter plot. 
* Drug A (Alprasolam) caused the most significant and positive impact on memory score. The higher the dosage, the greater the increase in memory score. 
* This is in contrast to the Drug T (Triazolam) and S (Sugar placebo) which had no discernable impact on memory score. 

A more generalized perspective with the bar chart below shows us that Drug A (Alprazolam) does indeed have a more positive impact on memory score as opposed to Drug T (Triazolam) and the Sugar placebo, which both resulted in a decrease in average memory score.

In [None]:
# Add histogram data
x0 = df['Mem_Score_Before'].loc[df['Happy_Sad_group'] == 'H']
x1 = df['Mem_Score_After'].loc[df['Happy_Sad_group'] == 'H']

fig = make_subplots(rows=1, cols=2)

binstart = x0.min()
binend = x0.max()

trace0 = go.Histogram(x=x0, ybins=dict(start=20, end=120, size=10), name='Before')
trace1 = go.Histogram(x=x1, ybins=dict(start=20, end=120, size=10), name='After')

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="Memory Score Before vs After of Happy Group")
fig.show()

In [None]:
# Add histogram data
x0 = df['Mem_Score_Before'].loc[df['Happy_Sad_group'] == 'S']
x1 = df['Mem_Score_After'].loc[df['Happy_Sad_group'] == 'S']

fig = make_subplots(rows=1, cols=2)

binstart = x0.min()
binend = x0.max()

trace0 = go.Histogram(x=x0, ybins=dict(start=20, end=120, size=10), name='Before')
trace1 = go.Histogram(x=x1, ybins=dict(start=20, end=120, size=10), name='After')

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="Memory Score Before vs After of Sad Group")
fig.show()

In [None]:
mem_score_avg = df.groupby(['age_cat', 'Drug'])[['Mem_Score_Before', 'Mem_Score_After']].agg('mean')
mem_score_avg.reset_index(inplace=True)
mem_score_avg.rename(columns={'Mem_Score_Before':'avg_mem_score_before', 
                              'Mem_Score_After':'avg_mem_score_after'}, inplace=True)
mem_score_avg = pd.melt(mem_score_avg, id_vars=['Drug', 'age_cat'], value_vars=['avg_mem_score_before', 'avg_mem_score_after'])

mem_score_avg.rename(columns={"variable":"avg_mem_score"}, inplace=True)
mem_score_avg.replace({'avg_mem_score_before': 'before', 'avg_mem_score_after':'after'}, inplace=True)
mem_score_avg

In [None]:
# Add histogram data
y0 = df['Mem_Score_Before'].loc[df['Drug'] == 'A']
y1 = df['Mem_Score_After'].loc[df['Drug'] == 'A']

fig = make_subplots(rows=1, cols=2)

binstart = y0.min()
binend = y0.max()

trace0 = go.Histogram(y=y0, ybins=dict(start=20, end=binend, size=10), name='Before')
trace1 = go.Histogram(y=y1, ybins=dict(start=20, end=binend, size=10), name='After')

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="Memory Score Before vs After of Drug A")
fig.show()

In [None]:
# Add histogram data
y0 = df['Mem_Score_Before'].loc[df['Drug'] == 'S']
y1 = df['Mem_Score_After'].loc[df['Drug'] == 'S']

fig = make_subplots(rows=1, cols=2)

binstart = y0.min()
binend = y0.max()

trace0 = go.Histogram(y=y0, ybins=dict(start=20, end=binend, size=10), name='Before')
trace1 = go.Histogram(y=y1, ybins=dict(start=20, end=binend, size=10), name='After')

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="Memory Score Before vs After of Drug S")
fig.show()

In [None]:
# Add histogram data
y0 = df['Mem_Score_Before'].loc[df['Drug'] == 'T']
y1 = df['Mem_Score_After'].loc[df['Drug'] == 'T']

fig = make_subplots(rows=1, cols=2)

binstart = y0.min()
binend = y0.max()

trace0 = go.Histogram(y=y0, ybins=dict(start=20, end=binend, size=10), name='Before')
trace1 = go.Histogram(y=y1, ybins=dict(start=20, end=binend, size=10), name='After')

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1, 
                  title_text="Memory Score Before vs After of Drug T")
fig.show()

<a id="diff"></a>

### DIFFERENCE COMPARISONS

In this section we'll perform an exploratory analysis of the values in the **Diff** column. We'll correlate them with other variables. Our results should reflect findings from the above analyses, however it will provide a different perspective on the data. 

In [None]:
fig = px.scatter(df, x="age", y="Diff", 
                 color="age_cat", 
                 template='plotly_dark')
fig.update_layout(title_text="Diff vs Age")
fig.show()

In [None]:
y0 = df[['Diff']].loc[df['age_cat'] == 'young adult']
y1 = df[['Diff']].loc[df['age_cat'] == 'middle age']
y2 = df[['Diff']].loc[df['age_cat'] == 'senior adult']


fig = go.Figure()

fig.add_trace(go.Box(y=y0['Diff'], name='Young Adult', boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y1['Diff'], name='Middle Age', boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y2['Diff'], name='Senior Adult', boxpoints="all", boxmean=True))

fig.update_layout(height=600,
                  width=1000,
                  title_text="Diff x Age Category",
                  yaxis=dict(autorange=True,
                             showgrid=True,
                             zeroline=True,
                             dtick=5,
                             gridcolor='rgb(255, 255, 255)',
                             gridwidth=1,
                             zerolinecolor='rgb(255, 255, 255)',
                             zerolinewidth=2,)                 
                 )

fig.show()

#### **OBSERVATIONS**

With the exceptions of a few outliers in the Young Adult age category, age doesn't seem to have a significant impact on the difference in memory score

In [None]:
y0 = df[['Diff']].loc[df['Drug'] == 'A']
y1 = df[['Diff']].loc[df['Drug'] == 'T']
y2 = df[['Diff']].loc[df['Drug'] == 'S']

fig = go.Figure()

fig.add_trace(go.Box(y=y0['Diff'], name='A: Alprazolam', boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y1['Diff'], name='T: Triazolam', boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y2['Diff'], name='S: Sugar', boxpoints="all", boxmean=True))

fig.update_layout(height=600, 
                  width=1000, 
                  title_text="Diff x Drug",
                  yaxis=dict(autorange=True,
                             showgrid=True,
                             zeroline=True,
                             dtick=5,
                             gridcolor='rgb(255, 255, 255)',
                             gridwidth=1,
                             zerolinecolor='rgb(255, 255, 255)',
                             zerolinewidth=2)
                 )
fig.show()

In [None]:
y0 = df[['Diff']].loc[df['Happy_Sad_group'] == 'H']
y1 = df[['Diff']].loc[df['Happy_Sad_group'] == 'S']

fig = go.Figure()

fig.add_trace(go.Box(y=y0['Diff'], name='Happy', boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y1['Diff'], name='Sad', boxpoints="all", boxmean=True))

fig.update_layout(height=600, 
                  width=1000, 
                  title_text="Diff x Happy/Sad Group",
                  yaxis=dict(autorange=True,
                             showgrid=True,
                             zeroline=True,
                             dtick=5,
                             gridcolor='rgb(255, 255, 255)',
                             gridwidth=1,
                             zerolinecolor='rgb(255, 255, 255)',
                             zerolinewidth=2)
                 )

fig.show()

In [None]:
x0=df['Drug'].loc[df['age_cat'] == 'young adult']
x1=df['Drug'].loc[df['age_cat'] == 'middle age']
x2=df['Drug'].loc[df['age_cat'] == 'senior adult']

y0 = df[['Diff']].loc[df['age_cat'] == 'young adult']
y1 = df[['Diff']].loc[df['age_cat'] == 'middle age']
y2 = df[['Diff']].loc[df['age_cat'] == 'senior adult']


fig = go.Figure()

fig.add_trace(go.Box(y=y0['Diff'], x=x0, name='young adult', marker_size=3,  boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y1['Diff'], x=x1, name='middle age', marker_size=3, boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y2['Diff'], x=x2, name='senior adult', marker_size=3, boxpoints="all", boxmean=True))

fig.update_layout(height=600, 
                  width=1000,
                  title_text='Diff x Drug & Age Category',
                  yaxis_title='Diff',
                  boxmode='group',
                  yaxis=dict(autorange=True,
                             showgrid=True,
                             zeroline=True,
                             dtick=5,
                             gridcolor='rgb(255, 255, 255)',
                             gridwidth=1,
                             zerolinecolor='rgb(255, 255, 255)',
                             zerolinewidth=2,)
)

fig.show()

### MEMORY SCORE DIFFERENCE BY DRUG & DOSAGE

In [None]:
x0=df['Drug'].loc[df['Dosage'] == 1]
x1=df['Drug'].loc[df['Dosage'] == 2]
x2=df['Drug'].loc[df['Dosage'] == 3]

y0 = df[['Diff']].loc[df['Dosage'] == 1]
y1 = df[['Diff']].loc[df['Dosage'] == 2]
y2 = df[['Diff']].loc[df['Dosage'] == 3]


fig = go.Figure()

fig.add_trace(go.Box(y=y0['Diff'], x=x0, name='1 Dose', marker_size=3,  boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y1['Diff'], x=x1, name='2 Doses', marker_size=3, boxpoints="all", boxmean=True))
fig.add_trace(go.Box(y=y2['Diff'], x=x2, name='3 Doses', marker_size=3, boxpoints="all", boxmean=True))

fig.update_layout(height=500, 
                  width=1000,
                  title_text='Diff x Drug & Dosage',
                  yaxis_title='Diff',
                  boxmode='group',
                  yaxis=dict(autorange=True,
                             showgrid=True,
                             zeroline=True,
                             dtick=10,
                             gridcolor='rgb(255, 255, 255)',
                             gridwidth=1,
                             zerolinecolor='rgb(255, 255, 255)',
                             zerolinewidth=2,)
)

fig.show()

### HIGHEST MEMORY SCORE DIFFERENCES

In this section we look at the samples containing both the 10 highest and 10 lowest values in the Diff column to determine which drug is associated with each. 

For the highest values, 9 out of the highest 10 memory score differences were from the drug Alprazolam with one belonging to Triazolam

For the lowest values there was more of a mixture, with 4 belonging to Sugar, 4 belonging to Triazolam, and 2 belonging to Alprazolam. 


In [None]:
# CREATE DATAFRAME CONTAINING HIGHEST 10 VALUES OF 'DIFF' COLUMN

top_10_diff = df.sort_values('Diff', ascending=False)[:10]
top_10_diff.sort_values('Diff', ascending=False, inplace=True)
# top_10_diff

In [None]:
fig = px.bar(top_10_diff, x='Diff', y='full_name', color="Drug",
             title='10 Patients with Greatest Mem Score Increase', 
             text='Diff', orientation='h', hover_data=["age_cat", "Dosage", 'Happy_Sad_group'])

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1,
                  yaxis={'categoryorder':'total ascending'}
                 )
fig.show()


In [None]:
low_10_diff = df.sort_values('Diff', ascending=True)[:10]
low_10_diff.sort_values('Diff', ascending=True, inplace=True)
# low_10_diff

In [None]:
fig = px.bar(low_10_diff, x='Diff', y='full_name', color='Drug',
             title='10 Patients with Greatest Mem Score Decrease', 
             text='Diff', orientation='h',
             hover_data=["age_cat", "Dosage", 'Happy_Sad_group'])

fig.update_layout(height=500, 
                  width=800, 
                  bargap=0.2, 
                  bargroupgap=0.1,
                  yaxis={'categoryorder':'total ascending'}
                  )
fig.show()


<a id="diffcat"></a>

### DIFFERENCE CATEGORY ANALYSIS

Earlier in this notebook the values in the "Diff" column were separated into three different categories: **Decrease** for the values that had a negative difference, **Increase** for the values that had a positive difference, **No Change** for the values that remained the same. 

In this section I visualized the number of samples belonging to each of those categories according to their Drug Type, Happy/Sad group, Dosage amount, and age category

In [None]:
diff_cat_count = df.groupby(['diff_cat', 'Drug'])[['Diff']].agg('count')
diff_cat_count.reset_index(inplace=True)
diff_cat_count.rename(columns={'Diff':'count'}, inplace=True)

labels = ["decrease", "increase", "no change"]
pie0 = diff_cat_count['count'].loc[diff_cat_count['Drug'] == 'A']
pie1 = diff_cat_count['count'].loc[diff_cat_count['Drug'] == 'T']
pie2 = diff_cat_count['count'].loc[diff_cat_count['Drug'] == 'S']

fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['Alprazolam', 'Triazolam', 'Sugar'])

fig.add_trace(go.Pie(labels=labels, values=pie0, name="Alprazolam"),
              1, 1)
fig.add_trace(go.Pie(labels=labels, values=pie1, name="Triazolam"),
              1, 2)
fig.add_trace(go.Pie(labels=labels, values=pie2, name="Sugar"),
              1, 3)

fig.update_traces(hoverinfo="label+name+value")

fig.update_layout(title_text="Diff Category According to Drug Type")

fig.show()

**OBSERVATIONS**

Alprazolam has the most positive impact on memory score with 70.1% of patients gaining an increase in memory score. 

In [None]:
diff_cat_hsg = df.groupby(['diff_cat', 'Happy_Sad_group'])[['Diff']].agg('count')
diff_cat_hsg.reset_index(inplace=True)
diff_cat_hsg.rename(columns={'Diff':'count'}, inplace=True)
diff_cat_hsg

labels = ["decrease", "increase", "no change"]
pie0 = diff_cat_hsg['count'].loc[diff_cat_hsg['Happy_Sad_group'] == 'H']
pie1 = diff_cat_hsg['count'].loc[diff_cat_hsg['Happy_Sad_group'] == 'S']

fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['Happy', 'Sad'])

fig.add_trace(go.Pie(labels=labels, values=pie0, name="Happy"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=pie1, name="Sad"), 1, 2)

fig.update_traces(hoverinfo="label+name+value")

fig.update_layout(title_text="Diff Category According to Happy/Sad Group")

fig.show()

**OBSERVATIONS**

Although patients primed with sad memories do report an increase in memory score at a rate approximately 2% greater than patients primed with Happy memories, there does not appear to be a significant difference in how happy or sad memories affect memory score. 

In [None]:
diff_cat_dose = df.groupby(['diff_cat', 'Dosage'])[['Diff']].agg('count')
diff_cat_dose.reset_index(inplace=True)
diff_cat_dose.rename(columns={'Diff':'count'}, inplace=True)
diff_cat_dose

labels = ["decrease", "increase", "no change"]
pie0 = diff_cat_dose['count'].loc[diff_cat_dose['Dosage'] == 1]
pie1 = diff_cat_dose['count'].loc[diff_cat_dose['Dosage'] == 2]
pie2 = diff_cat_dose['count'].loc[diff_cat_dose['Dosage'] == 3]

fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['1 Dose', '2 Doses', '3 Doses'])

fig.add_trace(go.Pie(labels=labels, values=pie0, name="1 Dose"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=pie1, name="2 Doses"), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=pie2, name="3 Doses"), 1, 3)

fig.update_traces(hoverinfo="label+name+value")

fig.update_layout(title_text="Diff Category According to Dosage")

fig.show()

**OBSERVATIONS**

On average, the patients who recieved a higher drug dosage reported higher increases in memory score. 

In [None]:
diff_cat_age = df.groupby(['diff_cat', 'age_cat'])[['Diff']].agg('count')
diff_cat_age.reset_index(inplace=True)
diff_cat_age.rename(columns={'Diff':'count'}, inplace=True)

labels = ["decrease", "increase", "no change"]
pie0 = diff_cat_age['count'].loc[diff_cat_age['age_cat'] == 'young adult']
pie1 = diff_cat_age['count'].loc[diff_cat_age['age_cat'] == 'middle age']
pie2 = diff_cat_age['count'].loc[diff_cat_age['age_cat'] == 'senior adult']

fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'}]], 
                    subplot_titles=['Young Adult', 'Middle Age', 'Senior Adult'])

fig.add_trace(go.Pie(labels=labels, values=pie0, name="Young Adult"), 1, 1)
fig.add_trace(go.Pie(labels=labels, values=pie1, name="Middle Adult"), 1, 2)
fig.add_trace(go.Pie(labels=labels, values=pie2, name="Senior"), 1, 3)

fig.update_traces(hoverinfo="label+name+value")

fig.update_layout(title_text="Diff Category According to Age Category")

fig.show()

In [None]:
diff_cat_age

**OBSERVATIONS**

On average, middle age patients responded more favorably to this study with a 5% to 6% higher rate of increased memory score. 

<a id="features"></a>

## ADDITIONAL FEATURE ENGINEERING

Let's use Label Encoder to transform some of the categorical variables into numerical values so that we may run our algorithms. 

In [None]:
from sklearn.preprocessing import LabelEncoder

df1 = pd.read_csv('../input/memory-test-on-drugged-islanders-data/Islander_data.csv')

# Happy Sad group: H = 0, S = 1
le = LabelEncoder()
le.fit(df1.Happy_Sad_group.drop_duplicates()) 
df1.Happy_Sad_group = le.transform(df1.Happy_Sad_group)

# Drug: A=0, S=1, T=2
le.fit(df1.Drug.drop_duplicates()) 
df1.Drug = le.transform(df1.Drug)


<a id="clustering"></a>

## CLUSTERING

<a id="kmeans"></a>

### K-MEANS CLUSTERING

In [None]:
from sklearn.cluster import KMeans 
from sklearn.preprocessing import StandardScaler

In [None]:
X = df1[['age', 'Happy_Sad_group', 'Dosage', 'Drug', 'Mem_Score_Before', 'Mem_Score_After', 'Diff']]

In [None]:
X_clus = StandardScaler().fit_transform(X)
X_clus

In [None]:
clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X_clus)
labels = k_means.labels_
print(labels)

In [None]:
df['cluster'] = labels
df.head()

In [None]:
cluster_centers = k_means.cluster_centers_
cluster_centers

In [None]:
cluster0 = df.loc[df['cluster'] == 0]
cluster1 = df.loc[df['cluster'] == 1]
cluster2 = df.loc[df['cluster'] == 2]

In [None]:
fig = px.scatter(df, x='Happy_Sad_group', y='Mem_Score_Before', color='cluster')
fig.update_layout(title='Memory Score Before Distribution by Happy/Sad Group')
fig.show()

<a id="hierarchical"></a>

### HIERARCHICAL CLUSTERING

In [None]:
from scipy import ndimage 
from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 
from matplotlib import pyplot as plt 
from sklearn import manifold, datasets 
from sklearn.cluster import AgglomerativeClustering 
import pylab

In [None]:
df2 = pd.read_csv('../input/memory-test-on-drugged-islanders-data/Islander_data.csv')

In [None]:
agglom = AgglomerativeClustering(n_clusters = 3, linkage = 'complete')
agglom.fit(X_clus)
agglom.labels_

In [None]:
dist_matrix = distance_matrix(X_clus,X_clus) 
print(dist_matrix)

In [None]:
Z = hierarchy.linkage(dist_matrix, 'complete')

In [None]:
df2['cluster'] = agglom.labels_
df2.head()

In [None]:
fig = pylab.figure(figsize=(18,50))
def llf(id):
    return '[%s %s %s %s]' % (df2['first_name'][id], df2['last_name'][id], df2['Happy_Sad_group'][id], df2['Mem_Score_Before'][id])
    
dendro = hierarchy.dendrogram(Z,  leaf_label_func=llf, leaf_rotation=0, leaf_font_size=12, orientation = 'right')