In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Predicting THC Content in Cannabis Distillates Using CBN Values: Addressing the THC Inflation Problem

## Introduction:

The burgeoning cannabis industry has witnessed exponential growth in recent years, fueled by the increasing legalization of cannabis for medicinal and recreational use across the globe. However, this rapid expansion has brought to light a concerning issue known as "THC inflation." THC, or delta-9-tetrahydrocannabinol, is the primary psychoactive compound found in cannabis, and its potency plays a crucial role in product quality and consumer safety.

THC inflation refers to the phenomenon where the reported THC content in cannabis products, particularly distillates, significantly exceeds the actual levels. This discrepancy has raised concerns among both the public and Licensed Producers (LPs), prompting efforts to address this pressing issue. Inflated THC levels can mislead consumers, compromise their experiences, and impact their overall perception of cannabis products. Additionally, it can undermine regulatory compliance and consumer trust in the industry.

Efforts to mitigate THC inflation involve a multifaceted approach, encompassing collaboration between LPs, consumers, and regulatory bodies. Among the various strategies proposed, a promising avenue is the use of advanced data analysis techniques and machine learning models to predict THC content accurately. Such models leverage data from potency testing, offering an alternative means of verification while reducing reliance on reported THC values.

This project focuses on one such innovative approach, aiming to use Cannabinol (CBN) values to predict THC levels in cannabis distillates with an impressive 92% accuracy rate. The dataset employed for this project comprises in-house potency results for approximately 500 distillate samples, meticulously generated by BZAM Management Inc.. This extensive dataset is instrumental in developing a robust predictive model.

The project commenced with a comprehensive Exploratory Data Analysis (EDA) phase, revealing a significant negative correlation (-0.83) between CBN and delta-9-THC (d9-THC) for distillates. This correlation suggests that as CBN levels rise, d9-THC levels tend to decrease, reflecting the natural degradation of THC over time due to exposure to high temperature, oxygen and light. Understanding this relationship is pivotal in the development of an accurate predictive model.

To build the predictive model, the dataset was divided into training and test sets, with 70% used for training and 30% for testing. The model was optimized using polynomial features, ensuring it captures the complex interplay between CBN and THC content in cannabis distillates.

In conclusion, addressing the issue of THC inflation in the cannabis industry is of paramount importance for the integrity and sustainability of the market. This project represents a significant step in the right direction by harnessing the power of machine learning to predict THC values accurately using CBN data, offering a potential solution to mitigate THC inflation and enhance transparency in the industry. The subsequent sections of this paper delve into the methodology, results, and implications of this groundbreaking approach, shedding light on its potential to reshape the cannabis testing landscape.

In [None]:
# Dataset from BZAM Management Inc. In-house potency testing data
df1 = pd.read_csv('../input/potency/potency.csv')
df2 = pd.read_csv('../input/potency2/potency2.csv')

# EDA

In [None]:
df2.tail()

In [None]:
df1.drop(['Source Batch/Lot','Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20'], axis=1, inplace=True)

In [None]:
df2.drop(['Source Batch/Lot'], axis=1, inplace=True)

In [None]:
# Making sure the two dataframes have similar columns
df1.columns


In [None]:
df2.columns

In [None]:
df1 = df1.loc[df1['Date of Analysis'] <= '2023-05-19', :]

In [None]:
df2 = df2.loc[df2['Date of Analysis'] <= '21-12-2022', :]

In [None]:
print(df1.shape)
print(df2.shape)

In [None]:
df_concat = pd.concat([df1, df2]).drop_duplicates()

In [None]:
df_concat.tail()

In [None]:
df_concat.shape

In [None]:
df_concat = df_concat.fillna(0)

In [None]:
df_concat.head(2)

In [None]:
df_concat.dtypes

In [None]:
# Changing the data types of numbers from string to numeric(float)

df_concat['THC Wt.%'] = pd.to_numeric(df_concat['THC Wt.%'], errors='coerce')
df_concat[' ∆9-THC (mg/g)'] = pd.to_numeric(df_concat[' ∆9-THC (mg/g)'], errors='coerce')
df_concat['THCA (mg/g)'] = pd.to_numeric(df_concat['THCA (mg/g)'], errors='coerce')
df_concat['Total THC (mg/g)'] = pd.to_numeric(df_concat['Total THC (mg/g)'], errors='coerce')

In [None]:
#Changing Date from string to date format
df_concat['Date of Analysis'] = pd.to_datetime(df_concat['Date of Analysis'], dayfirst=True)

In [None]:
df_concat.dtypes

In [None]:
df = df_concat

In [None]:
df.reset_index(drop=True, inplace=True)


In [None]:
df.set_index('Date of Analysis', inplace=True)

In [None]:
df.sort_values(by='Date of Analysis', inplace = True)

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.corr()

In [None]:
plt.hist(df[" ∆9-THC (mg/g)"], bins = 3)

# set x/y labels and plot title
plt.xlabel(" ∆9-THC (mg/g)")
plt.ylabel("count")
plt.title("∆9-THC (mg/g) bins")

As shown above, samples that test around 800mg/g are very common in our data. <br>
N.B: The df dataframe is a mix of different fractions of samples such as TDT(distillates), THO(heavy oils), TTE(terpenes), TDC(decarboxylated oil), and THT(THC and heavy oil). The goal of this project is to explore any correlations between cannabinoids in and accross these fractions for a start.

In [None]:
df.corr()

### Separation Into Different Sample Types/Fractions (TDT, THO, TTE, TDC, THT)

In [None]:
df_DT =df.loc[df['Sample ID'].str.startswith('RND-DT')].copy()
df_HO =df.loc[df['Sample ID'].str.startswith('RND-HO')].copy()
df_TE =df.loc[df['Sample ID'].str.startswith('RND-TE')].copy()
df_DC =df.loc[df['Sample ID'].str.startswith('RND-DC')].copy()
df_HT =df.loc[df['Sample ID'].str.startswith('RND-HT')].copy()
df_TDT = df.loc[df['Sample ID'].str.startswith('RND-TDT')].copy()
df_TDC = df.loc[df['Sample ID'].str.startswith('RND-TDC')].copy()
df_TTE =df.loc[df['Sample ID'].str.startswith('RND-TTE')].copy()
df_THT = df.loc[df['Sample ID'].str.startswith('RND-THT')].copy()
df_THO = df.loc[df['Sample ID'].str.startswith('RND-THO')].copy()

In [None]:
df_TDTT = pd.concat([df_DT, df_TDT]).drop_duplicates()
df_TDCC = pd.concat([df_DC, df_TDC]).drop_duplicates()
df_TTET = pd.concat([df_TE, df_TTE]).drop_duplicates()
df_THOT = pd.concat([df_HO, df_THO]).drop_duplicates()
df_THTT = pd.concat([df_HT, df_THT]).drop_duplicates()
df_TDT = df_TDTT
df_TDC = df_TDCC
df_TTE = df_TTET
df_THO = df_THOT
df_THT = df_THTT


In [None]:
print("TDT:", df_TDT.shape)
print("TDC:", df_TDC.shape)
print("THO:", df_THO.shape)
print("THT:", df_THT.shape)
print("TTE:", df_TTE.shape)

In [None]:
df_TDC.describe()

In [None]:
df_TDT.describe()

In [None]:
df_THT.describe()

In [None]:
df_THO.describe()

In [None]:
df_TTE.describe()

In [None]:
#TTE_THT_Ratio = df_TTE[[' ∆9-THC (mg/g)','CBGmg/g','CBNmg/g']]/df_THT[[' ∆9-THC (mg/g)','CBGmg/g','CBNmg/g']]

In [None]:
#TTE_THT_Ratio.replace(np.nan, 0 , inplace=True)

In [None]:
#TTE_THT_Ratio.shape

In [None]:
#THO_TDT_Ratio = df_THO[[' ∆9-THC (mg/g)','CBGmg/g', 'CBNmg/g','CBCmg/g']]/df_TDT[[' ∆9-THC (mg/g)','CBGmg/g','CBNmg/g','CBCmg/g']]

In [None]:
#THO_TDT_Ratio.replace(np.nan, 0 , inplace=True)

In [None]:
#THO_TDT_Ratio.shape

In [None]:
#TTE_THT_Ratio = TTE_THT_Ratio[(TTE_THT_Ratio[[' ∆9-THC (mg/g)','CBGmg/g', 'CBNmg/g']] !=0).all(axis=1)]


In [None]:
#TTE_THT_Ratio.shape

### Remove rows of the selected columns that have 0

In [None]:
#THO_TDT_Ratio = THO_TDT_Ratio[(THO_TDT_Ratio[[' ∆9-THC (mg/g)','CBGmg/g', 'CBNmg/g']] !=0).all(axis=1)]


In [None]:
#THO_TDT_Ratio.shape

In [None]:
#TTE_THT_Ratio.head(10)

In [None]:
#THO_TDT_Ratio.corr()

In [None]:
#TTE_THT_Ratio.corr()

### Heatmap Plot of Distillate Samples showing d9-THC, CBN, and CBG
Visualize any relationship between these three cannabinoids in distillates.

In [None]:
df_columns = df[[' ∆9-THC (mg/g)', 'CBGmg/g','CBNmg/g']]

In [None]:
df_TDT_Heat = df_TDT[['Sample ID',' ∆9-THC (mg/g)', 'CBGmg/g','CBNmg/g']]
df_TDT_Heat.set_index('Sample ID', inplace=True)
df_TDT_Heat.head()

In [None]:
index = df_TDT_Heat
columns = df_columns
plt.figure(figsize=(10,15))
plt.pcolor(df_TDT_Heat)
plt.yticks(np.arange(0.5, len(df_TDT_Heat.index), 1), df_TDT_Heat.index)
plt.xticks(np.arange(0.5, len(df_TDT_Heat.columns), 1), df_TDT_Heat.columns)
plt.show()

In [None]:
sns.heatmap(df_TDT_Heat)

### Sub-plots Visualizing the relationships between THC vs CBG, CBN and CBC for TDT and TDC Samples

In [None]:
fig = plt.figure()

ax0 = fig.add_subplot(2,3,1)
ax1 = fig.add_subplot(2,3,2)
ax2 = fig.add_subplot(2,3,3)
ax3 = fig.add_subplot(2,3,4)
ax4 = fig.add_subplot(2,3,5)
ax5 = fig.add_subplot(2,3,6)

#subplot 1 
df_TDT.plot(kind='scatter', x='CBGmg/g', y=' ∆9-THC (mg/g)', figsize=(20,10), ax=ax0)
ax0.set_title('d9-THC Vs CBG')
ax0.set_xlabel('CBGmg/g')
ax0.set_ylabel(' ∆9-THC (mg/g)')

#subplot 2
df_TDT.plot(kind='scatter',x='CBNmg/g', y=' ∆9-THC (mg/g)', figsize=(20,10), ax=ax1)
ax1.set_title('d9-THC Vs CBN')
ax1.set_xlabel('CBNmg/g')
ax1.set_ylabel(' ∆9-THC (mg/g)')

#subplot 3
df_TDT.plot(kind='scatter', x='CBCmg/g', y=' ∆9-THC (mg/g)', figsize=(20,10), ax=ax2)
ax2.set_title('d9-THC Vs CBC')
ax2.set_xlabel('CBCmg/g')
ax2.set_ylabel(' ∆9-THC (mg/g)')

#subplot 4
df_TDC.plot(kind='scatter', x='CBGmg/g', y=' ∆9-THC (mg/g)', figsize=(20,10), ax=ax3)
ax3.set_title('d9-THC Vs CBG')
ax3.set_xlabel('CBGmg/g')
ax3.set_ylabel(' ∆9-THC (mg/g)')

#subplot 2
df_TDC.plot(kind='scatter',x='CBNmg/g', y=' ∆9-THC (mg/g)', figsize=(20,10), ax=ax4)
ax4.set_title('d9-THC Vs CBN')
ax4.set_xlabel('CBNmg/g')
ax4.set_ylabel(' ∆9-THC (mg/g)')

#subplot 3
df_TDC.plot(kind='scatter', x='CBCmg/g', y=' ∆9-THC (mg/g)', figsize=(20,10), ax=ax5)
ax5.set_title('d9-THC Vs CBC')
ax5.set_xlabel('CBCmg/g')
ax5.set_ylabel(' ∆9-THC (mg/g)')

plt.suptitle('TDT samples (Top) Vs TDC samples (bottom)')
plt.show()

### Normalization
To see if any relationship exists between cannabinoids ratio across fractions.

In [None]:
df['CBG_THC_Ratio'] = df['CBGmg/g']/df[' ∆9-THC (mg/g)']
df['CBN_THC_Ratio'] = df['CBNmg/g']/df[' ∆9-THC (mg/g)']
df['CBC_THC_Ratio'] = df['CBCmg/g']/df[' ∆9-THC (mg/g)']
df['THC_Total_Ratio'] = df[' ∆9-THC (mg/g)']/df['Total']

In [None]:
df.head()

In [None]:
df.corr()

### Examining the relationships between cannabinoids in each fraction

In [None]:
df_TDT[[' ∆9-THC (mg/g)', 'CBGmg/g','CBNmg/g','CBCmg/g','Total']].corr()

In [None]:
df_TDC[[' ∆9-THC (mg/g)', 'CBGmg/g','CBNmg/g','CBCmg/g','Total']].corr()

In [None]:
df_THO[[' ∆9-THC (mg/g)', 'CBGmg/g','CBNmg/g','CBCmg/g','Total']].corr()

In [None]:
df_TTE[[' ∆9-THC (mg/g)', 'CBGmg/g','CBNmg/g','CBCmg/g','Total']].corr()

In [None]:
df_THT[[' ∆9-THC (mg/g)', 'CBGmg/g','CBNmg/g','CBCmg/g','Total']].corr()

### Potential Candidates

Distillate df_TDT: CBN, and CBC vs THC<br>
Heavy oil df_THO: CBN, CBG, CBC vs THC<br>
Terpene df_TTE: CBN, CBG, CBC vs THC<br>
Heavy+Terpenes df_THT: CBG, CBC vs THC<br>

Next we visualize the correlation coefficients that are over 0.60 to see any obvious relationship. The regplot and residplot codes below are used to visualize the relationship between the cannabinoids in each fraction

In [None]:
sns.regplot(x='CBGmg/g', y=' ∆9-THC (mg/g)', data=df_TTE)

In [None]:
sns.residplot(x=df_TTE[' ∆9-THC (mg/g)'],y=df_TTE['CBGmg/g'])
plt.show()

The Function below is used to test the regression fit (either linear or polynomial) between suspected cannabinoids in each fraction. The cannabinoids were suspected to be correlated based on the value of the correlation coefficient shown in the dataframes above for each oil fraction.

In [None]:
def PlotPolly(model, independent_variable, dependent_variable, Name):
    x_new = np.linspace(15, 55, 100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variable, '.', x_new, y_new, '-')
    plt.title('Polynomial Fit with Matplotlib for variables')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('THC')

    plt.show()
    plt.close()

In [None]:
x = df_TDT['CBGmg/g']
y = df_TDT[' ∆9-THC (mg/g)']

In [None]:
f = np.polyfit(x, y, 3)
p = np.poly1d(f)
print(p)

In [None]:
PlotPolly(p, x, y, 'CBGmg/g')

## Deductions from the plots
### Distillate TDT
 CBC vs THC: ploynomial<br>
 CBN vs THC: Linear(negative)<br>
 Total vs THC: Linear

### Decarbed oil TDC
 CBC vs THC: polynomial<br>
 Total vs THC: Linear

### Heavy oil THO
 Total vs THC: Linear<br>
 CBG vs THC: Linear<br>
 CBN vs THC: polynomial(5th order)<br>
 CBC vs THC: polynomial

### Terpene fraction TTE
 CBG vs THC: Linear<br>
 CBN vs THC: polynomial (5th order)<br>
 CBC vs THC: polynomial<br>
 Total vs THC: Linear

### Heavy+Terpene fraction THT
 CBG vs THC: Linear<br>
 CBC vs THC: polynomial (4th order)<br>
 Total vs THC: Linear

## CBN vs THC for Distillates
The primary objective of this study is to delve into the well-documented relationship between Cannabinol (CBN) and delta-9-tetrahydrocannabinol (THC). This relationship serves as a cornerstone in addressing the pressing issue of THC inflation within the cannabis industry. Additionally, the study aims to develop a robust predictive model that not only explores the CBN-THC dynamics but also instills a high degree of confidence in the reported THC values for cannabis distillates. This model's core function is to mitigate the impact of inflated THC data, which can be inadvertently generated by testing laboratories, and thereby foster greater transparency and accuracy in the industry

In [None]:
df_TDT = df_TDT.dropna()
df_TDT = df_TDT.reset_index()

In [None]:
df_TDT.head(3)

In [None]:
df_TDT.shape

In [None]:
X = df_TDT[['CBNmg/g']]
Y = df_TDT[' ∆9-THC (mg/g)']


In [None]:
# Pearson Correlation Coefficient
pearson_coef, p_value = stats.pearsonr(df_TDT['CBNmg/g'], df_TDT[' ∆9-THC (mg/g)'])
print( "The Pearson Correlation Coefficient is", round(pearson_coef,3), " with a P-value of P = ", p_value) 

### Predictive Analysis

In [None]:
lm = LinearRegression()
lm.fit(X,Y)
Yhat=lm.predict(X)
Yhat[0:5] 

A RobustScaler was chosen over the StandardScaler to minimize the effect of outliers on the model.<br>
The PolynomialFeatures was also applied to avoid overfitting or underfitting of the linear points, and ensure the optimal points are used for the linear regression for better predictive model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PolynomialFeatures

Input = [("scale", RobustScaler()), ('polynomial', PolynomialFeatures(degree=3,include_bias=False)),  ("model", LinearRegression())]

pipe = Pipeline(Input)
pipe

In [None]:
pipe.fit(X, Y)

ypipe=pipe.predict(X)
ypipe[0:5]

In [None]:
print("Slope:", lm.coef_)
print("Intercept:", lm.intercept_)

In [None]:
plt.figure(figsize=(12, 10))


ax1 = sns.kdeplot(Y, color="r")
sns.kdeplot(ypipe, color="g", ax=ax1)


plt.title('Actual(red) vs predicted(green) Values of THC for TDT samples')
plt.xlabel('d9-THC(mg/g)')
plt.ylabel('Proportion of Samples')

plt.show()
plt.close()

In [None]:
plt.figure(figsize=(12, 10))


ax1 = sns.kdeplot(Y, color="r")
sns.kdeplot(Yhat, color="b", ax=ax1)


plt.title('Actual(red) vs predicted(blue) Values of THC for TDT samples')
plt.xlabel('d9-THC(mg/g)')
plt.ylabel('Proportion of Samples')

plt.show()
plt.close()

As one could see above from the KDE plots, the model performs better using the pipeline (having the robustscaler and polynomial features) than just using only the linear regression (Yhat).

## CBG vs THC in Heavy oils (THO)

In [None]:
#df_THO = df_THO.dropna()
#df_THO = df_THO.reset_index()

In [None]:
#pearson_coef, p_value = stats.pearsonr(df_THO['CBGmg/g'], df_THO[' ∆9-THC (mg/g)'])
#print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value ) 

In [None]:
#X1 = df_THO[['CBGmg/g']]
#Y1 = df_THO[' ∆9-THC (mg/g)']
#lm2 = LinearRegression()
#lm2.fit(X1,Y1)
#Yhat_1=lm2.predict(X1)
#Yhat_1[0:10] 

In [None]:
#Input2 = [("scale", RobustScaler()), ('polynomial', PolynomialFeatures(degree=8,include_bias=False)),  ("model", LinearRegression())]

#pipe2 = Pipeline(Input2)
#pipe2

#pipe2.fit(X1, Y1)

#ypipe2=pipe2.predict(X1)
#ypipe2[0:10]

In [None]:
#plt.figure(figsize=(12, 10))


#ax1 = sns.kdeplot(Y1, color="r")
#sns.kdeplot(ypipe2, color="g", ax=ax1)


#plt.title('Actual vs predicted(green) Values of THC in THO samples')
#plt.xlabel('d9-THC(mg/g)')
#plt.ylabel('Proportion of Samples')

#plt.show()
#plt.close()

## CBG VS THC In Terpenes Fraction(TTE)

In [None]:
#df_TTE = df_TTE.dropna()
#df_TTE = df_TTE.reset_index()
#X2 = df_TTE[['CBGmg/g']]
#Y2 = df_TTE[' ∆9-THC (mg/g)']
#lm3 = LinearRegression()
#lm3.fit(X2,Y2)
#Yhat_2=lm3.predict(X2)
#Yhat_2[0:10] 

In [None]:
#Input3 = [("scale", RobustScaler()), ('polynomial', PolynomialFeatures(degree=3,include_bias=False)),  ("model", LinearRegression())]

#pipe3 = Pipeline(Input3)
#pipe3

#pipe3.fit(X2, Y2)

#ypipe3=pipe3.predict(X2)
#ypipe3[0:10]

In [None]:
#pearson_coef, p_value = stats.pearsonr(df_TTE['CBGmg/g'], df_TTE[' ∆9-THC (mg/g)'])
#print( "The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value ) 

In [None]:
#plt.figure(figsize=(12, 10))


#ax1 = sns.kdeplot(Y2, color="r")
#sns.kdeplot(ypipe3, color="g", ax=ax1)


#plt.title('Actual vs predicted(green) Values of THC in TTE samples')
#plt.xlabel('d9-THC(mg/g)')
#plt.ylabel('Proportion of Samples')

#plt.show()
#plt.close()

### Multi-linear Regression Relationships for any type/fraction of sample (all dataset)

Three independent variables - CBN, CBG, and CBC were used to predict the d9-THC values.

In [None]:
df = df.dropna()
df = df.reset_index()

In [None]:
lm1 = LinearRegression()
Z = df[['CBNmg/g','CBGmg/g', 'CBCmg/g']]
lm1.fit(Z, df[' ∆9-THC (mg/g)'])

In [None]:
Yhat1=lm1.predict(Z)
Yhat1[0:10] 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import PolynomialFeatures

Input1 = [("scale", RobustScaler()), ('polynomial', PolynomialFeatures(degree=6,include_bias=False)),  ("model", LinearRegression())]

pipe1 = Pipeline(Input1)
pipe1

In [None]:
pipe1.fit(Z, df[' ∆9-THC (mg/g)'])

ypipe1=pipe1.predict(Z)
ypipe1[0:5]

In [None]:
plt.figure(figsize=(12, 10))


ax1 = sns.kdeplot(df[' ∆9-THC (mg/g)'], color="r")
sns.kdeplot(ypipe1, color="b", ax=ax1)


plt.title('Actual vs Predicted(blue) Values for THC')
plt.xlabel('THC values')
plt.ylabel('Proportion of samples')

plt.show()
plt.close()

# Model Evaluation

### R-squared(r2) and Mean Square Error (MSE)

#### r2 and MSE of CBN vs THC in distillates (TDT)

In [None]:
lm.fit(X, Y)
print('The R-square is: ', lm.score(X, Y))

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y, Yhat)
print('The mean square error is: ', mse)


In [None]:
lm.fit(X, ypipe)
print('The R-square is: ', lm.score(X, ypipe))

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(Y, ypipe)
print('The mean square error is: ', mse)

####  r2 and mse for Multi-linear Regression Relationship for any type of sample

In [None]:
Yhat1=lm1.predict(Z)
print('The R-square is: ', lm1.score(Z, df[' ∆9-THC (mg/g)']))

mse = mean_squared_error(df[' ∆9-THC (mg/g)'], Yhat1)
print('The mean square error is: ', mse)

## Training and Testing

#### Training and testing for all types of samples(Z)

In [None]:
y_data = df[' ∆9-THC (mg/g)']

In [None]:
x_data = df.drop(' ∆9-THC (mg/g)', axis=1)

In [None]:
x_data.shape

In [None]:
from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.30, random_state=1)


print("number of test samples :", x_test.shape[0])
print("number of training samples:",x_train.shape[0])

### Ridge Regression of multilinear relationship

In [None]:
pr=PolynomialFeatures(degree=4)
x_train_pr=pr.fit_transform(x_train[['CBNmg/g','CBGmg/g', 'CBCmg/g']])
x_test_pr=pr.fit_transform(x_test[['CBNmg/g','CBGmg/g', 'CBCmg/g']])

In [None]:
from sklearn.linear_model import Ridge
RidgeModel=Ridge(alpha=10)
RidgeModel.fit(x_train_pr, y_train)

In [None]:
RidgeModel.score(x_test_pr, y_test)


In [None]:
yhat_R = RidgeModel.predict(x_test_pr)

print('predicted:', yhat_R[0:10])
print('test set :', y_test[0:10].values)

#### Best Hyperparameter using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

parameters1= [{'alpha': [0.0001,0.001,0.1,1, 10, 100, 1000]}]
parameters1

In [None]:
RR=Ridge()
RR

In [None]:
Grid1 = GridSearchCV(RR, parameters1,cv=7)

Grid1.fit(x_data[['CBNmg/g','CBGmg/g', 'CBCmg/g']], y_data)

In [None]:
BestRR=Grid1.best_estimator_
BestRR

In [None]:
BestRR.score(x_test[['CBNmg/g','CBGmg/g', 'CBCmg/g']], y_test)

### Cross validation of CBN vs THC models in TDT samples

In [None]:
x_data1 = df_TDT.drop(' ∆9-THC (mg/g)', axis=1)
y_data1 = df_TDT[' ∆9-THC (mg/g)']

In [None]:
df_TDT.shape

In [None]:
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data1, y_data1, test_size=0.30, random_state=7)


print("number of test samples :", x_test1.shape[0])
print("number of training samples:",x_train1.shape[0])

#### Polynomial features is applied to standardize the training and test sets for better fit improving the linearity

In [None]:
pr = PolynomialFeatures(degree=3)
x_train_pr1 = pr.fit_transform(x_train1[['CBNmg/g']])
x_test_pr1 = pr.fit_transform(x_test1[['CBNmg/g']])
pr

In [None]:
poly = LinearRegression()
poly.fit(x_train_pr1, y_train1)

In [None]:
yhat_pr = poly.predict(x_test_pr1)

In [None]:
print("Predicted values:", yhat_pr[0:10])
print("True values:", y_test1[0:10].values)

In [None]:
plt.figure(figsize=(12, 10))


ax1 = sns.kdeplot(y_test1, label="actual", color="r") # Actual values of test set
sns.kdeplot(yhat_pr, color="b", label= "predicted training set",ax=ax1) # Predicted values of test set


plt.title('Actual vs Predicted Values THC for test set')
plt.xlabel('THC values')
plt.ylabel('Proportion of samples')
plt.legend()

plt.show()
plt.close()

In [None]:
 # accuracy of actual potency versus predicted potency of train set 
from sklearn.metrics import r2_score
round(r2_score(y_train1, poly.predict(x_train_pr1)),4)

In [None]:
# accuracy of actual potency versus predicted potency of test set 
round(r2_score(y_test1, yhat_pr),4)

In [None]:
# or 
round(poly.score(x_test_pr1, y_test1),4)

In [None]:
print("Slope:", poly.coef_)
print("Intercept:", poly.intercept_)

### Predictions on Unknowns Samples

In [None]:

cbn_value = 10

pred2array =[[cbn_value]]
potency = poly.predict(pr.fit_transform(pred2array))
print("The potency value for a CBN value of ", pred2array,"mg/g is:",potency,"mg/g")

In [None]:
#from sklearn.model_selection import cross_val_score
#Rcross = cross_val_score(pipe, x_data1[['CBNmg/g']], y_data1, cv=8)
#Rcross

In [None]:
print("predicted values:", Yhat[0:10])
print("Predicted train values:", yhat_pr[0:10])
print("True values:", y_test1[0:10].values)

In [None]:
plt.figure(figsize=(12, 10))


ax1 = sns.kdeplot(df_TDT[' ∆9-THC (mg/g)'], label="actual", color="r") # True whole dataset distribution
sns.kdeplot(yhat_pr, color="b", label= "predicted training set",ax=ax1) # Prediction on the training set distribution after splitting into train/test set
sns.kdeplot(Yhat, color="g", label="Prediction_whole_LR", ax=ax1) # Prediction on the whole data before splitting into train and test set with Linear regression only
sns.kdeplot(ypipe, color="purple", label="prediction_whole_Pipline", ax=ax1)  # Prediction on the whole data before test-train split using pipline (applying robustscaler and polynomial function)


plt.title('Actual vs Predicted Values for THC')
plt.xlabel('THC values')
plt.ylabel('Proportion of samples')
plt.legend()

plt.show()
plt.close()

## Conclusions

In conclusion, this project represents a significant stride toward addressing the issue of THC inflation within the cannabis industry. The developed predictive model, which accurately predicts THC values in cannabis distillates with impressive precision well within the margin of error, stands as a potent tool for bolstering transparency and reliability. Its ability to pinpoint instances where THC values deviate from expected norms not only safeguards against inflated THC data but also serves as a crucial quality assurance mechanism for testing laboratories. This multifaceted approach not only ensures consumer trust and satisfaction but also contributes to the broader goal of advancing industry standards and best practices. By enhancing the accuracy of reported THC values and providing checks and balances for quality assurance, this model heralds a promising future for the cannabis testing landscape.<br>

As the cannabis industry continues to evolve, the importance of accurate potency testing cannot be overstated. This project not only provides a solution to the problem of inflated THC values but also empowers industry stakeholders, including producers, consumers, and regulatory bodies, to make informed decisions based on reliable data. With its robust performance and capacity to uphold quality standards, the model not only stands as a testament to innovation but also as a vital tool for fostering trust, accountability, and the continued growth of the cannabis sector.








In [None]:
df.to_csv('potency_processed.csv')
