## Data Cleaning and Understanding
We begin by importing all the necessary libraries that we will be using throughout the data-cleaning process. This includes libraries for data manipulation, natural language processing, and language detection.

In [2]:
import pandas as pd
import csv
import sys
import matplotlib.pyplot as plt
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from langdetect import detect
from langdetect import DetectorFactory
DetectorFactory.seed = 0  # For reproducible results

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\legac\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\legac\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\legac\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Next, we load the dataset (making sure to specify the columns because of how the data was initially collated) containing various information such as title, drug, dosage, delivery method, weight, year, gender, and report. We'll take an initial look at the information about our data.


In [3]:
df = pd.read_csv('D:/Cloud/Google Drive/Colab Notebooks/Data/Full.csv', 
                 encoding='ISO-8859-1',
                 usecols=['title', 'drug', 'dosage', 'delivery', 'weight', 'year', 'gender', 'report'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39244 entries, 0 to 39243
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     39131 non-null  object
 1   drug      39126 non-null  object
 2   dosage    35497 non-null  object
 3   delivery  37851 non-null  object
 4   weight    39124 non-null  object
 5   year      39124 non-null  object
 6   gender    39123 non-null  object
 7   report    39121 non-null  object
dtypes: object(8)
memory usage: 2.4+ MB


  df = pd.read_csv('D:/Cloud/Google Drive/Colab Notebooks/Data/Full.csv',


In [4]:
df.head()

Unnamed: 0,title,drug,dosage,delivery,weight,year,gender,report
0,Ode to Joy,MDMA,1.5 tablets,oral,185.0,2000.0,male,My friend had some experience with X and had t...
1,Make Sure the Music's Not Too Complex,Cannabis,,smoked,0.0,1999.0,not specified,This was the first experience that either my f...
2,After Hours,"MDMA, MDMA, MDMA","160 mg, 100 mg, 50 mg","oral, oral, insufflated",150.0,2001.0,male,Preparation: I have heard some conflicting opi...
3,Heavy Hallucinogenic Expreince,"25C-NBOMe, Alcohol - Beer/Wine","1 hit, 2 glasses","sublingual, oral",165.0,2012.0,male,00:00 - Just dropped tab alone. Holding under ...
4,Love and the Velvet Underground,DMT,100 mg,smoked,170.0,2013.0,male,Setting: my apartmentDate: 4/24/2013I lit up t...


The first things we notice are the 'title' column, which doesn't seem like it has any real value. It also looks like there are some NaNs in the dosage column and some 0s in the weight column. And the year seems to be a float instead of an integer. For this project, our two most important columns will be the 'drug' column, which contains which drug was used, and the 'report' column, which contains the qualitative story data. Let's take a closer look at their values.

In [32]:
len(df['drug'].unique())

11718

That's a LOT of different drug types! As it turns out, there are sometimes multiple drugs listed per row, so we will have to find a way to seperate them into different units. That should reduce our number of unique inputs. Now let's take a look at the report column to see what we are working with:

In [7]:
lengths = df['report'].str.len()
lengths.describe()

count    39121.000000
mean      4849.531505
std       4447.779060
min          8.000000
25%       1944.000000
50%       3586.000000
75%       6201.000000
max      32759.000000
Name: report, dtype: float64

The smallest report is 8 words, which seems off, and the largest is over 32k words. So lots of variation! Those larger reports are going to be harder to manage, as even the largest LLMs have token limits (the amount of data that they can learn from in a single instance). Given the size and scope of our data, we are going to have to find ways to trip the fat so that our processing can be lean and focus primarily on what we want to reproduce: accurate human-sounding trip reports of psychedelics.

Let's look at what some of these reports consist of:

In [20]:
df['report'].sample(20)

23065    The whole situation started last year, I decid...
21386    My doctor suggested I try this for mild anxiet...
16550    I had taken this substance by insufflation, ca...
30668    I personally think chamomile is an awesome her...
20112    Just recently while visiting friends during th...
18751    Up to this experience, I've tried mushrooms an...
3872     Ayahuasca in the Jungles of PeruAfter the cere...
10744    I really had nothing to do one night, and my f...
20756    The other night a friend I met a friend of min...
24193    About 9 months ago I blew out a disc in my bac...
18106    I initially woke up with the thought of taking...
19950    Sitting in the park with a few of my girlfrien...
19450    One day I read that nutmeg was usable as a dru...
21099    I probably only write about 1/4 of my trips or...
34769    I ate mushrooms while visiting a friend in the...
22784    I went to the dentist to get a filling replace...
25010    Calea has been sat in my store cupboard for ov.

It looks like there are some non-English reports. We will want to drop those as part of our processing. In addition to lowercasing the text and removing some of the symbols and notation.

In [21]:
def detect_language(text):
    try:
        return detect(text)
    except:
        return None

In [22]:
# Record the original number of rows
original_count = df.shape[0]

# Detect language and filter the rows
df['language'] = df['report'].apply(detect_language)
df = df[df['language'] == 'en']
df = df.drop('language', axis=1)

# Calculate the number of rows that were dropped
dropped_count = original_count - df.shape[0]

print(f"{dropped_count} rows were dropped. The DataFrame now has {df.shape[0]} rows.")

192 rows were dropped. The DataFrame now has 39052 rows.


Next we want to clean up our rows and columns. We won't need the title or year column, so we can drop those. We also know that the delivery, dosage, gender and reports are strings, and the weight is an integer, so let's adjust their types. 

In [23]:
# Drop the 'year' and 'title' columns
df = df.drop(['year', 'title'], axis=1)

# Convert columns to desired data types
df['drug'] = df['drug'].astype(str)
df['delivery'] = df['delivery'].astype(str)
df['gender'] = df['gender'].astype(str)
df['report'] = df['report'].astype(str)

# Convert 'weight' column to float with error handling
df['weight'] = pd.to_numeric(df['weight'], errors='coerce')

# Replace NaN values with 0
df['weight'] = np.nan_to_num(df['weight'], nan=0)

# Convert 'weight' column to rounded integers
df['weight'] = np.round(df['weight']).astype('Int64')

# Verify the updated data types using df.info()
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 39052 entries, 0 to 39243
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   drug      39052 non-null  object
 1   dosage    35427 non-null  object
 2   delivery  39052 non-null  object
 3   weight    39052 non-null  Int64 
 4   gender    39052 non-null  object
 5   report    39052 non-null  object
dtypes: Int64(1), object(5)
memory usage: 2.1+ MB
None


We still have some missing values in the dosage (3747) column. Let's fill them with 'unknown' and also replace the '0' weights with the average weight.

In [24]:
# Replace 'null' values in 'dosage' column with 'unknown'
df['dosage'] = df['dosage'].fillna('unknown')

# Calculate the average weight
avg_weight = int(np.round(df['weight'].mean()))

# Replace '0' values in 'weight' column with the average weight
df['weight'] = np.where(df['weight'] == 0, avg_weight, df['weight'])

# Verify the updated data using df.info()
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 39052 entries, 0 to 39243
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   drug      39052 non-null  object
 1   dosage    39052 non-null  object
 2   delivery  39052 non-null  object
 3   weight    39052 non-null  object
 4   gender    39052 non-null  object
 5   report    39052 non-null  object
dtypes: object(6)
memory usage: 2.1+ MB
None


That looks better! Now let's preprocess our reports so that we can use them in our classification models later on. We'll start by converting them to lowercase, removing non letters and numbers, and lemmatizing (find the root of each word).

In [25]:
def preprocess_text(text):
    text = text.lower()
    sentences = sent_tokenize(text)
    words = [word_tokenize(sentence) for sentence in sentences]
    words = [re.sub(r'\W', '', word) for sentence in words for word in sentence]
    words = [lemmatizer.lemmatize(word) for word in words if word]
    return words

df['processed_report'] = df['report'].apply(preprocess_text)

## Data Engineering

With this preprocssing done, let's start doing some data engineering. We begin by creating a new dataframe so that if we make mistakes, we can reset to our cleaned model.

In [26]:
new_df = df.copy()

new_df.head()

Unnamed: 0,drug,dosage,delivery,weight,gender,report,processed_report
0,MDMA,1.5 tablets,oral,185,male,My friend had some experience with X and had t...,"[[my, friend, had, some, experience, with, x, ..."
1,Cannabis,unknown,smoked,152,not specified,This was the first experience that either my f...,"[[this, wa, the, first, experience, that, eith..."
2,"MDMA, MDMA, MDMA","160 mg, 100 mg, 50 mg","oral, oral, insufflated",150,male,Preparation: I have heard some conflicting opi...,"[[preparation, , i, have, heard, some, conflic..."
3,"25C-NBOMe, Alcohol - Beer/Wine","1 hit, 2 glasses","sublingual, oral",165,male,00:00 - Just dropped tab alone. Holding under ...,"[[0000, , just, dropped, tab, alone, ], [holdi..."
4,DMT,100 mg,smoked,170,male,Setting: my apartmentDate: 4/24/2013I lit up t...,"[[setting, , my, apartmentdate, , 4242013i, li..."


Let's start by dropping rows with less than 50 words so that our future GPT-2 model can have quality reports to learn from. This also has the added benefit of removing some of the weird rows that were appearing during randomization that had chunks or reports without any other data.

In [46]:
new_df = new_df[new_df['report'].apply(lambda x: len(str(x).split()) >= 50)]

We'll start our engineering by focusing on the 'drug' column. Just like our reports, we will lowercase first. We also want to create a column that indicates whether a drug report relies on mixing drugs. This way we can start thinking about how we might want to weight our data so that reports that only rely on one drug type are given more weight. 

In [27]:
#level out the drug names for sorting
new_df['drug'] = new_df['drug'].str.lower()

# Create the 'mixed' column
new_df['mixed'] = new_df['drug'].str.strip().str.contains(',')
new_df['mixed'] = new_df['mixed'].fillna(False).astype(int)

Because we want to isolate drug types, having multiple drugs in a column doesn't provide us the information we need. So we will explode the rows so that each only contains one drug type, but keeps the rest of the information the same and ties each unique drug to its corresponding dosage and delivery. The result is that our 'mixed' reports will end up having more data, so we will need to balance that our with our unmixed columns later.

In [29]:
def explode_dataframe(new_df):
    # Create a copy of the DataFrame
    exploded_df = new_df.copy()

    # Convert the desired columns to lists
    exploded_df['drug'] = exploded_df['drug'].str.split(',')
    exploded_df['dosage'] = exploded_df['dosage'].str.split(',')
    exploded_df['delivery'] = exploded_df['delivery'].str.split(',')

    # Prepare an empty list to store new rows
    new_rows = []

    # Iterate over each row in the DataFrame
    for _, row in exploded_df.iterrows():
        drugs = row['drug']
        dosages = row['dosage']
        deliveries = row['delivery']

        # Check if the row has matching element counts
        if len(drugs) != len(dosages) or len(drugs) != len(deliveries):
            # Fill unmatched values with 'unknown'
            min_length = min(len(drugs), len(dosages), len(deliveries))
            drugs += ['unknown'] * (min_length - len(drugs))
            dosages += ['unknown'] * (min_length - len(dosages))
            deliveries += ['unknown'] * (min_length - len(deliveries))

        # Create new rows by matching the corresponding elements
        for drug, dosage, delivery in zip(drugs, dosages, deliveries):
            new_row = row.copy()
            new_row['drug'] = drug
            new_row['dosage'] = dosage
            new_row['delivery'] = delivery
            new_rows.append(new_row)

    # Concatenate all new rows into a new DataFrame
    exploded_df = pd.DataFrame(new_rows)

    # Reset the index to reflect the new rows
    exploded_df.reset_index(drop=True, inplace=True)

    return exploded_df

new_df = explode_dataframe(new_df)

In [30]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76922 entries, 0 to 76921
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   drug              76922 non-null  object
 1   dosage            76922 non-null  object
 2   delivery          76922 non-null  object
 3   weight            76922 non-null  int64 
 4   gender            76922 non-null  object
 5   report            76922 non-null  object
 6   processed_report  76922 non-null  object
 7   mixed             76922 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 4.7+ MB


In [31]:
new_df.head(15)

Unnamed: 0,drug,dosage,delivery,weight,gender,report,processed_report,mixed
0,mdma,1.5 tablets,oral,185,male,My friend had some experience with X and had t...,"[[my, friend, had, some, experience, with, x, ...",0
1,cannabis,unknown,smoked,152,not specified,This was the first experience that either my f...,"[[this, wa, the, first, experience, that, eith...",0
2,mdma,160 mg,oral,150,male,Preparation: I have heard some conflicting opi...,"[[preparation, , i, have, heard, some, conflic...",1
3,mdma,100 mg,oral,150,male,Preparation: I have heard some conflicting opi...,"[[preparation, , i, have, heard, some, conflic...",1
4,mdma,50 mg,insufflated,150,male,Preparation: I have heard some conflicting opi...,"[[preparation, , i, have, heard, some, conflic...",1
5,25c-nbome,1 hit,sublingual,165,male,00:00 - Just dropped tab alone. Holding under ...,"[[0000, , just, dropped, tab, alone, ], [holdi...",1
6,alcohol - beer/wine,2 glasses,oral,165,male,00:00 - Just dropped tab alone. Holding under ...,"[[0000, , just, dropped, tab, alone, ], [holdi...",1
7,dmt,100 mg,smoked,170,male,Setting: my apartmentDate: 4/24/2013I lit up t...,"[[setting, , my, apartmentdate, , 4242013i, li...",0
8,4-aco-dmt,30 mg,IM,152,male,Been feeling a lack of motivation of sorts. La...,"[[been, feeling, a, lack, of, motivation, of, ...",0
9,wormwood,2 tsp,oral,225,male,About three days ago I had purchased a 200 gra...,"[[about, three, day, ago, i, had, purchased, a...",1


Next! We want to reduce the amount of unique drug types. It turns out that many of these drugs have very similar effects our could be classed together. So we will manually create a list for sorting the different values, and then create a function that does the sorting for us.

In [34]:
new_df['drug'] = new_df['drug'].str.strip()

def clean_and_categorize(drug):
    # Define the substances in each category
    mushrooms = ['mushrooms', 'mushrooms - p. tampanensis', 'mushrooms - p. cubensis', 'mushrooms - p. atlantis', 'psilocin',
                 'amanitas - a. muscaria', 'mushrooms - p. semilanceata', 'amanitas', 'mushrooms - p. cyanescens',
                 'mushrooms - p. mexicana']
    entactogens = ['mda', 'methylone', '4-methylmethcathinone', '6-apb', '6-apdb', '6-mapb', '5-apb', '3-methylmethcathinone',
                   '3mmc', '3-mmc', '2-aminoindan', '4-fluoroamphetamine', 'mdpv', '4-fluoromethamphetamine', 'mdai',
                   'bk-mbdb', '5-mapb', 'bk-mdea', '2-fluoroamphetamine', 'bad/suspect ecstasy']
    dissociatives = ['methoxetamine', 'methoxphenidine', 'ephenidine', '3F-PCP', 'nitrous oxide', 'diphenidine', 'inhalants',
                     'inhalants - nitrites', 'dxm']
    nootropics = ['piracetam', 'l-theanine', 'ginkgo biloba', 'dmae'
                  'aniracetam', 'noopept', 'vitamins - choline', 'coluracetam', 'smarts', 'smarts - sulbutiamine',
                  'l-dopa', 'meclofenoxate', 'citicoline', 'coluracetam', 'methylliberine', 'levetiracetam', 'aniracetam',
                  'tryptophan', 'huperzine', 'melatonin', 'pramiracetam', 'alpha-gpc', 'tyrosine']
    synthetic_cannabinoids = ['products - spice-like smoking blends', 'jwh-018', 'am-2201',
                              'thj-018', 'synthetic cannabinoids', 'cannabidiol', 'AB-CHMINACA',
                              'ab-fubinaca', 'mdmb-chmica', 'ur-144', 'xlr-11', 'jwh-250']
    entheogens = ['morning glory', 'syrian rue', 'h.b. woodrose', 'banisteriopsis caapi',
                        'tabernanthe iboga', 'voacanga africana', 'harmala alkaloids', 'epimedium spp.',
                        'boophone disticha', 'heimia myrtifolia', 'acacia maidenii', 'sida cordifolia',
                        'ilex guayusa', 'acacia phlebophylla', 'changa', 'leonotis leonurus', 'leonotis nepetifolia',
                        'heimia salicifolia']
    depressants = ['products - spice-like smoking blends', 'ghb', 'opioids', 'benzodiazepines', 'oxygen',
                         'barbiturates (butalbital)', 'ghv', 'gbl', 'ether', 'butalbital']
    stimulants = ['tobacco', 'tobacco - cigarettes', 'cocaine', 'modafinil', 'methamphetamine', '4-methylethcathinone', 'pentedrone', 'methiopropamine', 'nicotine',
                  '2-fluoromethamphetamine', 'ethylphenidate', 'alpha-pvp', 'amphetamines', 'ethylcathinone', '5-it', 'crack',
                  'ephedrine', 'lisdexamfetamine', 'bzp', '3f-phenmetrazine', 'coca', 'adrafinil', 'armodafinil', 'methcathinone' ]
    hallucinogens = ['nbome series', '25c-nbome', '25i-nbome', 'doc', 'dpt', '4-aco-dipt', '4-ho-det', 'brugmansia', '4-ho-dipt',
                     'amt', 'products - bath salts', '4-ho-mipt', '25b-nbome', '4-ho-met', 'doi', 'mipt', 'al-lad',
                     '4-aco-dalt', 'datura', 'dipt', '4-aco-det', 'dob', 'dom', 'belladonna', 'bromo-dragonfly', 'toad venom',
                     'tma-2', '4-aco-mipt', 'ald-52', 'llylescaline', '4-aco-met', 'lsz']
    anxiolytics = ['etizolam', 'benzodiazepines', 'clonazolam', 'bromazepam', 'diclazepam', 'fluclotizolam', 'meopp']
    antidepressants = ['"st. johns wort"', 'duloxetine', 'fluclotizolam', 'maois', 'moclobemide']
    pharmaceuticals = ['dexamethasone', 'varenicline', 'tripelennamine', 'triprolidine', 'clozapine', 'clobazam',
                       'pseudoephedrine', 'diphenhydramine', 'loperamide', 'chloral hydrate', 'products - other',
                       'dimenhydrinate', 'aspirin', 'tapentadol', 'chlorpheniramine maleate', 'acetaminophen',
                       'pharmaceuticals', 'ergoloid mesylates', 'scopolamine', 'hydroxyzine', 'naloxone']
    opioids = ['opium', 'heroin', 'codeine', 'hydrocodone', 'methadone', 'kratom', 'ah-7921', 'opiates',
               'poppies - opium', 'hydromorphone', 'oxycodone', 'morphine', 'poppies', 'poppies - california']
    DMT = ['mimosa tenuiflora', 'anadenanthera', 'psychotria viridis', 'acacia confusa', 'acacia', 'det']
    mescaline = ['cacti - t. pachanoi', 'cacti - t. peruvianus', 'cacti - columnar', 'peyote', 'cacti - t. bridgesii']
    ibogaine = ['ibogaine']
    oneirogen = ['silene capensis', 'calea zacatechichi', 'entada rheedii', 'mugwort']
    phencyclidine = ['pcp', 'pce', 'dizocilpine']
    botanicals = ['kava', 'wormwood', 'nutmeg', 'betel nut', 'tea', 'caffeine', 'coffee', 'yohimbe', 'tagetes lucida',
               'chamomile', 'mullein', 'damiana', 'lactuca - l. virosa', 'harmaline', 'sceletium tortuosum',
               'ginger', 'valerian', 'catnip', 'skullcap', 'passion flower', 'capsaicin', 'herbal ecstasy', 'theanine',
               'lotus/lily', 'yerba mate', 'coleus', 'calamus', 'ginseng', 'lotus/lily - nymphaea nouchali var caerulea',
               'hops', 'cacao', 'ephedra sinica']

    if 'salvia' in drug: return 'salvia'
    elif '3-meo' in drug: return '3-MeO'
    elif '5-meo' in drug: return '5-meo'
    elif '2c' in drug: return '2c'
    elif 'cannabis' in drug: return 'cannabis'
    elif 'ayahuasca' in drug: return 'ayahuasca'
    elif 'alcohol' in drug: return 'alcohol'
    elif 'absinthe' in drug: return 'alcohol'
    elif 'ketamine' in drug: return 'ketamine'
    elif 'lsd' in drug: return 'lsd'
    elif 'mdma' in drug: return 'mdma'
    elif any(x in drug for x in DMT) or 'dmt' in drug: return 'DMT'
    elif any(x in drug for x in mescaline) or 'mescaline' in drug: return 'mescaline'
    elif any(x in drug for x in ibogaine) or 'ibogaine' in drug: return 'ibogaine'
    elif any(x in drug for x in oneirogen): return 'oneirogen'
    elif any(x in drug for x in phencyclidine): return 'phencyclidine'
    elif any(x in drug for x in nootropics) or 'vitamin' in drug: return 'nootropic'
    elif any(x in drug for x in pharmaceuticals) or 'pharms' in drug: return 'pharmaceutical'
    elif any(x in drug for x in mushrooms) or 'mushrooms' in drug: return 'mushrooms'
    elif drug in synthetic_cannabinoids: return 'synthetic cannabinoid'
    elif drug in entactogens: return 'entactogen'
    elif drug in dissociatives: return 'dissociative'
    elif drug in opioids: return 'opioid'
    elif drug in entheogens: return 'entheogen'
    elif drug in depressants: return 'depressant'
    elif drug in hallucinogens: return 'hallucinogen'
    elif drug in stimulants: return 'stimulant'
    elif drug in anxiolytics: return 'anxiolytic'
    elif drug in antidepressants: return 'antidepressant'
    elif drug in botanicals: return 'botanical'
    elif drug == '[]': return 'unknown'
    elif drug == 'various': return 'unknown'
    elif drug == 'nan': return 'unknown'
    elif drug == 'unknown': return 'unknown'
    elif drug == '1': return 'unknown'
    else: return 'other'

# Create a new column 'drug_category' by applying the 'clean_and_categorize' function to the 'drug' column
new_df['drug_category'] = new_df['drug'].apply(clean_and_categorize)
len(new_df['drug_category'].unique())


31

31 different drug types is still a bit too high. Let's check the values of each of those columns:

In [47]:
new_df['drug_category'].value_counts()

drug_category
pharmaceutical           10105
cannabis                  9606
stimulant                 5784
mushrooms                 4072
botanical                 3810
opioid                    3796
mdma                      3579
alcohol                   3421
hallucinogen              3120
lsd                       3096
other                     2849
salvia                    2844
2c                        2818
dissociative              2640
entheogen                 2550
DMT                       2364
entactogen                1754
nootropic                 1659
5-meo                     1292
ketamine                  1251
unknown                    921
mescaline                  730
depressant                 685
anxiolytic                 457
synthetic cannabinoid      451
oneirogen                  308
ayahuasca                  232
phencyclidine              219
antidepressant             167
3-MeO                      137
ibogaine                    84
Name: count, dtype: int64

Now that we've got in narrowed down to 31, we can narrow down even further to just 10 columns:

In [49]:
# Define the mapping from old categories to new ones
category_mapping = {
    'pharmaceutical': 'Pharmaceutical',
    'cannabis': 'Cannabinoid',
    'stimulant': 'Stimulant',
    'mushrooms': 'Psychedelic',
    'botanical': 'Other',
    'opioid': 'Opioid',
    'mdma': 'Entactogen/Empathogen',
    'alcohol': 'Depressant',
    'hallucinogen': 'Psychedelic',
    'lsd': 'Psychedelic',
    'salvia': 'Psychedelic',
    '2c': 'Psychedelic',
    'other': 'Other',
    'dissociative': 'Dissociative',
    'entheogen': 'Entheogen',
    'DMT': 'Psychedelic',
    'entactogen': 'Entactogen/Empathogen',
    'nootropic': 'Other',
    '5-meo': 'Psychedelic',
    'ketamine': 'Dissociative',
    'unknown': 'Other',
    'mescaline': 'Psychedelic',
    'depressant': 'Depressant',
    'synthetic cannabinoid': 'Cannabinoid',
    'anxiolytic': 'Pharmaceutical',
    'oneirogen': 'Other',
    'ayahuasca': 'Psychedelic',
    'phencyclidine': 'Dissociative',
    'antidepressant': 'Pharmaceutical',
    '3-MeO': 'Other',
    'ibogaine': 'Entheogen'
}

# Apply the mapping to the 'drug_category' column
new_df['drug_category'] = new_df['drug_category'].map(category_mapping)

new_df['drug_category'].value_counts()

drug_category
Psychedelic              20568
Pharmaceutical           10729
Cannabinoid              10057
Other                     9684
Stimulant                 5784
Entactogen/Empathogen     5333
Dissociative              4110
Depressant                4106
Opioid                    3796
Entheogen                 2634
Name: count, dtype: int64

10 Drug types is far more manageable. It also had the added benefit of increasing the values of our largest and most important category 'psychedelics.' We definitely lose some nuance in particular psychedelic type, but that is a problem for another day. In fact, we need to reduce our data set even more. We will keep all the data for the 'psychedelic', 'Entactogen/Empathogen', 'entheogen' categories, but we will reduce all other categories by 50%.

In [66]:
# Categories to keep all data
full_categories = ['Psychedelic', 'Entactogen/Empathogen', 'Entheogen']

# Create an empty DataFrame to hold the final result
final_df = pd.DataFrame()

# Keep all the data in the full_categories
for cat in full_categories:
    final_df = pd.concat([final_df, new_df[new_df['drug_category'] == cat]])

# For other categories, reduce by 50%
other_categories = set(new_df['drug_category'].unique()) - set(full_categories)
for cat in other_categories:
    cat_df = new_df[new_df['drug_category'] == cat]

    # Sample 50% of the data in this category without replacement
    reduced_df = cat_df.sample(frac=0.5, replace=False, random_state=42)

    final_df = pd.concat([final_df, reduced_df])

# Reset the index of the final DataFrame
final_df.reset_index(drop=True, inplace=True)

final_df.value_counts('drug_category')

drug_category
Psychedelic              20568
Pharmaceutical            5364
Entactogen/Empathogen     5333
Cannabinoid               5028
Other                     4842
Stimulant                 2892
Entheogen                 2634
Dissociative              2055
Depressant                2053
Opioid                    1898
Name: count, dtype: int64

But we still have the problem of the mixed drugs:

In [67]:
final_df['mixed'].value_counts()

mixed
1    37459
0    15208
Name: count, dtype: int64

We will respond to this problem by doubling the values of our 'unmixed' column, so that those reports that are of a single drug type are given more weight for our predictions and our chat generator.

In [68]:
# Identify not mixed reports
not_mixed_df = final_df[final_df['mixed'] == 0]

# Append not mixed reports to the final dataframe once (to double them)
final_df = pd.concat([final_df, not_mixed_df])

# Shuffle the data so it's not all the same reports in a row
final_df = final_df.sample(frac=1).reset_index(drop=True)

final_df['mixed'].value_counts()

mixed
1    37459
0    30416
Name: count, dtype: int64

This has the added benefit of balancing our classes and giving more weight to our 'pure' drug reports.

In [69]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67875 entries, 0 to 67874
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   drug              67875 non-null  object
 1   dosage            67875 non-null  object
 2   delivery          67875 non-null  object
 3   weight            67875 non-null  int64 
 4   gender            67875 non-null  object
 5   report            67875 non-null  object
 6   processed_report  67875 non-null  object
 7   mixed             67875 non-null  int64 
 8   drug_category     67875 non-null  object
dtypes: int64(2), object(7)
memory usage: 4.7+ MB


In [70]:
final_df.head()

Unnamed: 0,drug,dosage,delivery,weight,gender,report,processed_report,mixed,drug_category
0,mdpv,,insufflated,140,male,"MDPV, Over a Two-Month PeriodI thought I'd sha...","[[mdpv, , over, a, twomonth, periodi, thought,...",1,Entactogen/Empathogen
1,crack,repeated,smoked,165,male,I'll start this story with a bit of background...,"[[i, ll, start, this, story, with, a, bit, of,...",0,Stimulant
2,methcathinone,10 mg,IV,135,male,I hate the name of this drug. Impossible for m...,"[[i, hate, the, name, of, this, drug, ], [impo...",1,Stimulant
3,jwh-018,,smoked,209,male,"Me - Male (95kg, 29 y/o, moderate experience o...","[[me, , male, , 95kg, , 29, yo, , moderate, ex...",1,Cannabinoid
4,pharms - pregabalin,,,170,male,GBL AddictI was thinking of writing this for a...,"[[gbl, addicti, wa, thinking, of, writing, thi...",1,Pharmaceutical


It looks like when we exploded the columns, some of the dosage and delivery columns were left null. Let's fill them before saving our cleaned and processed dataset.

In [71]:
# Replace 'null' values in 'dosage' and 'delivery' column with 'unknown'
final_df['dosage'] = final_df['dosage'].fillna('unknown')
final_df['delivery'] = final_df['delivery'].fillna('unknown')

In [10]:
final_df.to_csv('D:/Cloud/Google Drive/Colab Notebooks/Data/processed.csv', index=False)