# Final Project - Week 6 Submission
### Data Analysis
### Due: Tuesday, December 3, 2024 at 11:59 PM

Building upon your Week 5 cleaned dataset, this week focuses on conducting data analysis of your **cleaned dataset** to uncover patterns, relationships, and key insights that address your research questions. 

#### Submission Requirements:
- Submit a PDF or Jupyter notebook (.ipynb) containing:
  - All code with outputs visible
  - Clear documentation for data analysis (Part 4)
    - You can use markdown in Jupyter notebook, or write your final summary in Word/PDF. If you choose Word/PDF, please still submit your notebook/PDF containing the code)<br><br>

    
  
- For individual submission: Include your name in either the filename or within the notebook content
- For team submission: 
  1. Include all team members' names in the notebook content
  2. For team members from different sessions, clearly indicate their session numbers
  3. All team members are required to submit a copy of the assignment

>**Note**: I provide some example codes for some data analysis tasks. Feel free to modify these codes to fit your specific dataset and research questions, or you can write your own code that better suits your analysis needs.

In [1]:
import os
import pandas as pd
import numpy as np

os.chdir('/Users/nkohei/Workspace/McDaniel-Repository/522/final project')
os.getcwd()

df = pd.read_csv('cleaned_dataset.csv')

df['all_tooth_exist'] = np.where(
    (df['num_incisors'] == 4) & 
    (df['num_canines'] == 2) & 
    (df['num_premolars'] == 4) & 
    (df['num_molars'] == 4), 
    'all_detected', 
    'missing_or_extracted'
)

df['arch_discrepancy_severity'] = pd.cut(
    df['arch_length_discrepancy'],  # Column to categorize
    bins=4,  # Number of equal-width bins
    labels=['low', 'medium', 'high', 'very_high']  # Custom labels
)



<br>

## Part 1: Descriptive Statistics
Using your cleaned dataset from the previous week, conduct a comprehensive descriptive statistical analysis to summarize the key characteristics of the data.
>**Note**: If you have already completed this analysis in a previous step, there is no need to repeat it. However, if your cleaned dataset differs from the dataset before cleaning, please use your final dataset for this part to identify any new findings or insights.

1. **Central Tendency**

- Measures like mean, median, and mode
  - *Purpose*: Reveals the typical or representative values in your variables, helping identify patterns and trends relevant to your research questions. For example, if studying house prices, these measures indicate whether average house prices in one area are higher than others, offering insights into housing market dynamics and regional differences.

2. **Dispersion**

- Measures like standard deviation, variance, range, and IQR (Interquartile Range)
  - *Purpose*: Shows data variability and distribution shape, helping identify significant variations in your key variables. For example, high dispersion in test scores might indicate educational inequality, while low dispersion in prices might suggest market stability.

In [2]:
df.describe()

Unnamed: 0,detected_arch_length,simulated_arch_length,arch_length_discrepancy,num_incisors,num_canines,num_premolars,num_molars,avg_confidence_incisor,avg_confidence_canine,avg_confidence_premolar,avg_confidence_molar
count,4929.0,4929.0,4929.0,4929.0,4929.0,4929.0,4929.0,4929.0,4929.0,4929.0,4929.0
mean,2012.522671,1910.30128,102.221391,4.007507,2.000609,3.93589,4.342057,0.890299,0.892062,0.894154,0.902066
std,125.853247,118.478953,79.634064,0.178332,0.084273,0.373031,0.735699,0.013824,0.026383,0.019228,0.021852
min,1500.511894,1181.883742,-340.88745,2.0,0.0,2.0,1.0,0.737696,0.0,0.68274,0.612709
25%,1935.920903,1841.472148,60.088269,4.0,2.0,4.0,4.0,0.88915,0.89049,0.8931,0.90214
50%,2011.437658,1912.98549,87.406464,4.0,2.0,4.0,4.0,0.892354,0.895022,0.897191,0.907191
75%,2088.885411,1979.329743,121.486154,4.0,2.0,4.0,5.0,0.895338,0.899111,0.90113,0.911424
max,2546.80391,2449.704441,1204.224423,6.0,3.0,7.0,7.0,0.908154,0.917876,0.925435,0.931001


In [3]:
def calculate_range(series):
    return series.max() - series.min()

def calculate_iqr(series):
    return series.quantile(0.75) - series.quantile(0.25)

numeric_columns = df.select_dtypes(include='number')

aggregation_results = numeric_columns.agg(['var', 'std', calculate_range, calculate_iqr])
aggregation_results



Unnamed: 0,detected_arch_length,simulated_arch_length,arch_length_discrepancy,num_incisors,num_canines,num_premolars,num_molars,avg_confidence_incisor,avg_confidence_canine,avg_confidence_premolar,avg_confidence_molar
var,15839.039705,14037.262368,6341.584173,0.031802,0.007102,0.139152,0.541252,0.000191,0.000696,0.00037,0.000477
std,125.853247,118.478953,79.634064,0.178332,0.084273,0.373031,0.735699,0.013824,0.026383,0.019228,0.021852
calculate_range,1046.292016,1267.820699,1545.111873,4.0,3.0,5.0,6.0,0.170458,0.917876,0.242695,0.318292
calculate_iqr,152.964508,137.857595,61.397885,0.0,0.0,0.0,1.0,0.006188,0.008621,0.00803,0.009284


1. Central Tendency
- Mean aggregation: Detected_arch_length has longer than simulated_arch_length. Given Orthodontics potential patients, this discrepancy is expected
- Min aggregation: Given negative value at arch_length_discrepancy, some patient has much shorter detected_arch_length.


2. Dispersion
- Std: Because arch simulation method employs smoothed curve fitting, it has less standard deviation than detected arch length.
- Range x IQR : detected arch length has longer IQR than simulated_arch_length, but not same result from Range. It may indicate the some outliner still exist in either length value.




----

2. **Categorical Analysis**
   
For categorical variables:

- Calculate and interpret frequency distributions <br><br>
- If applicable and needed, create cross-tabulations for relevant variable pairs
     - Purpose: Displays the interaction between two categorical variables, helping to uncover relationships between categories.
     - Example: Analyzing a cross-tabulation of gender and product purchase might show that females are more likely to purchase a particular product than males, providing valuable input for marketing strategies.<br><br>
- If applicable and needed, conduct Chi-square tests of independence.
     - Purpose: Tests whether there is a statistically significant association between two categorical variables.
     - Example: For instance, a Chi-square test might reveal that gender significantly impacts customer feedback preferences (p-value < 0.05), offering key insights for customer segmentation strategies.

In [4]:
# Example code for categorical analysis
# Analyze relationship between employee gender and department distribution

# Frequency analysis
print(df['label'].value_counts())
print(df['label'].value_counts(normalize=True))

# Cross-tabulation analysis
dept_gender_tab = pd.crosstab(
   df['label'],
   df['all_tooth_exist'],
   margins=True
)

# Chi-square test
from scipy.stats import chi2_contingency
contingency_table = pd.crosstab(df['label'], df['all_tooth_exist'])


chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print("---")
print(f'Chi-square test statistic: {chi2}')
print(f'p-value: {p_value}')
print(f'Degrees of freedom: {dof}')
print('Expected frequencies:')
print(expected)

label
basic          3168
pro             624
regrettable     569
invisalign      568
Name: count, dtype: int64
label
basic          0.642727
pro            0.126598
regrettable    0.115439
invisalign     0.115236
Name: proportion, dtype: float64
---
Chi-square test statistic: 44.997469089591235
p-value: 9.264168738865696e-10
Degrees of freedom: 3
Expected frequencies:
[[2084.36275107 1083.63724893]
 [ 373.71150335  194.28849665]
 [ 410.55629945  213.44370055]
 [ 374.36944614  194.63055386]]


>**Optional**: If your dataset is suitable, we can also practice the Data Wrangling concepts we learned this week. Below are some general examples, but please adapt these codes according to your dataset or practice with other codes as needed.

In [5]:
# Example 1: Multi-level indexing
df_multi = df.set_index(
    ['label', 'all_tooth_exist']
).sort_index()

# Example 2: Index names setting
df_multi.index.names = ['label', 'all_tooth_exist']

df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,detected_arch_length,simulated_arch_length,arch_length_discrepancy,num_incisors,num_canines,num_premolars,num_molars,avg_confidence_incisor,avg_confidence_canine,avg_confidence_premolar,avg_confidence_molar,detected_arch_length_outliers_flg,arch_discrepancy_severity
label,all_tooth_exist,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
basic,all_detected,1885.302329,1816.008383,69.293946,4,2,4,4,0.887684,0.894043,0.906347,0.913622,False,medium
basic,all_detected,1971.601911,1948.123640,23.478271,4,2,4,4,0.895110,0.900412,0.900419,0.907105,False,low
basic,all_detected,1682.730307,1660.049287,22.681020,4,2,4,4,0.894340,0.890298,0.895315,0.914344,False,low
basic,all_detected,1889.994699,1797.104997,92.889702,4,2,4,4,0.894349,0.892443,0.885249,0.908655,False,medium
basic,all_detected,1803.870817,1720.724704,83.146113,4,2,4,4,0.891748,0.898954,0.898250,0.913284,False,medium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
regrettable,missing_or_extracted,1885.352280,1792.052169,93.300111,3,2,4,4,0.889672,0.899963,0.894997,0.914774,False,medium
regrettable,missing_or_extracted,1876.981307,1787.897487,89.083820,4,2,4,3,0.873379,0.884697,0.899809,0.908093,False,medium
regrettable,missing_or_extracted,2365.197006,2218.604058,146.592947,5,2,4,6,0.879106,0.892412,0.885477,0.826385,False,medium
regrettable,missing_or_extracted,2382.072374,2187.436244,194.636130,4,2,4,6,0.887715,0.893488,0.895949,0.893946,False,medium


In [6]:
# Example 3: Pivot table transformation
pivot_result = pd.pivot_table(
    df,
    values='simulated_arch_length',
    index=['label', 'all_tooth_exist'],
    columns='arch_discrepancy_severity'
)

pivot_result

  pivot_result = pd.pivot_table(


Unnamed: 0_level_0,arch_discrepancy_severity,low,medium,high,very_high
label,all_tooth_exist,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
basic,all_detected,1907.52943,1896.107729,1567.233475,
basic,missing_or_extracted,1973.022381,1939.921014,1723.179717,
invisalign,all_detected,1943.234279,1893.780442,,
invisalign,missing_or_extracted,2009.875878,1960.830164,1550.431395,
pro,all_detected,1907.113823,1897.033699,,
pro,missing_or_extracted,2030.177174,1936.812662,1714.977095,
regrettable,all_detected,1913.685486,1875.341052,1688.800034,
regrettable,missing_or_extracted,1961.329248,1898.286401,1729.345967,1286.472635


In [7]:
# Example 4: Wide to long format
df_long = pd.melt(
    df,
    id_vars=['label'],
    value_vars=['detected_arch_length', 'simulated_arch_length'],
)

df_long

Unnamed: 0,label,variable,value
0,invisalign,detected_arch_length,1971.032470
1,basic,detected_arch_length,2083.646892
2,pro,detected_arch_length,1816.608868
3,basic,detected_arch_length,1885.302329
4,basic,detected_arch_length,1971.601911
...,...,...,...
9853,basic,simulated_arch_length,1969.037121
9854,basic,simulated_arch_length,1853.592013
9855,basic,simulated_arch_length,1851.329048
9856,basic,simulated_arch_length,1715.296808


<br>

## Part 2: Inferential Statistics
Choose appropriate statistical tests based on your research questions and data types. Some common tests include:

>**Note**: Below are some models we covered in previous statistics courses. You don’t need to use all of them — just pick the ones that fit your data and help answer your research questions. If you’d like to practice, you can try all of them. If your data or research questions are not suitable for all these models, you can skip this section and focus solely on completing Part 1.


1. **Correlation Analysis**

- For continuous variables
- Calculate correlation coefficients
- Test for statistical significance

In [8]:
df_numeric = df.select_dtypes(include='number')
df_numeric.corr()

Unnamed: 0,detected_arch_length,simulated_arch_length,arch_length_discrepancy,num_incisors,num_canines,num_premolars,num_molars,avg_confidence_incisor,avg_confidence_canine,avg_confidence_premolar,avg_confidence_molar
detected_arch_length,1.0,0.789175,0.406266,0.116979,0.052578,0.133022,0.558208,-0.118154,-0.041614,0.019533,-0.198164
simulated_arch_length,0.789175,1.0,-0.240584,0.062838,-0.009482,0.113416,0.443147,-0.029606,0.016957,0.09549,-0.14572
arch_length_discrepancy,0.406266,-0.240584,1.0,0.091384,0.0972,0.041489,0.222879,-0.142683,-0.090995,-0.111199,-0.096377
num_incisors,0.116979,0.062838,0.091384,1.0,0.053706,-0.04157,0.036105,-0.591,-0.153589,-0.022295,-0.089574
num_canines,0.052578,-0.009482,0.0972,0.053706,1.0,-0.011669,-0.003359,-0.099897,0.049661,-0.029708,-0.000413
num_premolars,0.133022,0.113416,0.041489,-0.04157,-0.011669,1.0,-0.005849,0.01875,0.006089,-0.011493,0.06619
num_molars,0.558208,0.443147,0.222879,0.036105,-0.003359,-0.005849,1.0,-0.042276,-0.015992,0.025455,-0.245393
avg_confidence_incisor,-0.118154,-0.029606,-0.142683,-0.591,-0.099897,0.01875,-0.042276,1.0,0.216836,0.064242,0.115937
avg_confidence_canine,-0.041614,0.016957,-0.090995,-0.153589,0.049661,0.006089,-0.015992,0.216836,1.0,0.191332,0.023156
avg_confidence_premolar,0.019533,0.09549,-0.111199,-0.022295,-0.029708,-0.011493,0.025455,0.064242,0.191332,1.0,0.327301


2. **Group Comparisons**   
    Choose appropriate tests based on your data:

- T-tests (comparing two groups)
   - Example: A t-test can help determine if the average salary of employees in two departments (e.g., Sales vs. Marketing) is significantly different.<br><br>
- ANOVA (comparing multiple groups)
   - Example: ANOVA can test whether average house prices vary significantly between different cities (e.g., City A, City B, City C).<br><br>
- Optional: Non-parametric tests: used when assumptions of parametric tests are not met, such as the Mann-Whitney U test or Kruskal-Wallis test.

In [9]:
from scipy import stats

# T-test: Compare arch lengths between basic and invisalign cases
label_ttest_basic_pro = stats.ttest_ind(
    df[df['label'] == 'basic']['arch_length_discrepancy'],
    df[df['label'] == 'pro']['arch_length_discrepancy']
)

print("T-test: Compare arch lengths between basic and pro cases")
print(label_ttest_basic_pro)

label_ttest_basic_invisalign = stats.ttest_ind(
    df[df['label'] == 'basic']['arch_length_discrepancy'],
    df[df['label'] == 'invisalign']['arch_length_discrepancy']
)

print("T-test: Compare arch lengths between basic and invisalign cases")
print(label_ttest_basic_invisalign)

label_ttest_basic_regrettable = stats.ttest_ind(
    df[df['label'] == 'basic']['arch_length_discrepancy'],
    df[df['label'] == 'regrettable']['arch_length_discrepancy']
)

print("T-test: Compare arch lengths between basic and regrettable cases")
print(label_ttest_basic_invisalign)

T-test: Compare arch lengths between basic and pro cases
TtestResult(statistic=np.float64(-0.565350489134896), pvalue=np.float64(0.5718688580157401), df=np.float64(3790.0))
T-test: Compare arch lengths between basic and invisalign cases
TtestResult(statistic=np.float64(-8.902384104924641), pvalue=np.float64(8.361599635662965e-19), df=np.float64(3734.0))
T-test: Compare arch lengths between basic and regrettable cases
TtestResult(statistic=np.float64(-8.902384104924641), pvalue=np.float64(8.361599635662965e-19), df=np.float64(3734.0))


In [10]:
# ANOVA: Compare arch length discrepancies across different labels
labels = ['basic', 'pro', 'invisalign', 'regrettable']
discrepancies_by_label = [df[df['label'] == label]['arch_length_discrepancy'] for label in labels]
label_anova = stats.f_oneway(*discrepancies_by_label)
label_anova
#F_onewayResult(statistic=np.float64(107.55344653758648), pvalue=np.float64(1.899824063528475e-67))


F_onewayResult(statistic=np.float64(107.55344653758648), pvalue=np.float64(1.899824063528475e-67))

3. **Regression Analysis**

- Simple Linear Regression: To examine the relationship between a continuous dependent variable and a single independent variable.
   - Example: Predicting house prices (dependent variable) based on the size of the house (independent variable).<br><br>
- Multiple Linear Regression: To examine the relationship between a continuous dependent variable and multiple independent variables.
   - Example: Predicting house prices (dependent variable) based on multiple factors like the size of the house, number of bedrooms, and location (independent variables).<br><br>
- Logistic Regression: If the dependent variable is binary (e.g., yes/no, success/failure), logistic regression can be used.
   - Example: Predicting whether a customer will buy a product (binary dependent variable: yes/no) based on their age and income (independent variables).

In [11]:
# Example 1: Simple Linear Regression
# Study relationship between house price (dependent) and square footage (independent)

import statsmodels.api as sm
from sklearn.preprocessing import LabelEncoder

# Prepare data
X = df['arch_length_discrepancy'] 
le = LabelEncoder()
y = le.fit_transform(df['label'])
X = sm.add_constant(X)  # Add constant term

# Fit model
simple_model = sm.OLS(y, X).fit()

# View detailed results
print("\nSimple Linear Regression Results:")
print(simple_model.summary())


Simple Linear Regression Results:
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.039
Model:                            OLS   Adj. R-squared:                  0.039
Method:                 Least Squares   F-statistic:                     199.6
Date:                Tue, 03 Dec 2024   Prob (F-statistic):           1.91e-44
Time:                        21:03:23   Log-Likelihood:                -7239.9
No. Observations:                4929   AIC:                         1.448e+04
Df Residuals:                    4927   BIC:                         1.450e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------

In [12]:
# Example 2: Multiple Linear Regression
X__numeric = df.select_dtypes(include='number')
X_multi = sm.add_constant(X__numeric)
y_multi = le.fit_transform(df['label'])

multi_model = sm.OLS(y_multi, X_multi).fit()

print("\nMultiple Linear Regression Results:")
print(multi_model.summary())


Multiple Linear Regression Results:
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.063
Model:                            OLS   Adj. R-squared:                  0.061
Method:                 Least Squares   F-statistic:                     30.05
Date:                Tue, 03 Dec 2024   Prob (F-statistic):           4.76e-62
Time:                        21:03:23   Log-Likelihood:                -7177.4
No. Observations:                4929   AIC:                         1.438e+04
Df Residuals:                    4917   BIC:                         1.446e+04
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------

<br>

In [13]:

# Example 3: Logistic Regression
# Predict binary outcome (e.g., whether customer will buy a product)
#build for basic prediction model
from statsmodels.formula.api import logit
from statsmodels.discrete.discrete_model import MNLogit

df['label'] = df['label'].astype('category')

X = df[['arch_length_discrepancy', 'simulated_arch_length', 'detected_arch_length', 
        'num_incisors', 'num_canines', 'num_premolars', 'num_molars']]

X = sm.add_constant(X)

y = df['label']

mnlogit_model = sm.MNLogit(y, X).fit()

# Print the summary
print("\nMultinomial Logistic Regression Results:")
print(mnlogit_model.summary())

         Current function value: 1.009534
         Iterations: 35

Multinomial Logistic Regression Results:
                          MNLogit Regression Results                          
Dep. Variable:                  label   No. Observations:                 4929
Model:                        MNLogit   Df Residuals:                     4905
Method:                           MLE   Df Model:                           21
Date:                Tue, 03 Dec 2024   Pseudo R-squ.:                 0.03300
Time:                        21:03:24   Log-Likelihood:                -4976.0
converged:                      False   LL-Null:                       -5145.8
Covariance Type:            nonrobust   LLR p-value:                 2.543e-59
       label=invisalign       coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                      -3.8416      1.716     -2.238      0.025      -7.

  bse = np.sqrt(np.diag(self.cov_params()))


## Optional: Part 3: Advanced Models 
Some advanced models, such as machine learning models (e.g., Random Forest, Support Vector Machine) and time series analysis models, will be covered in future courses. If you are already familiar with these techniques and wish to practice, feel free to explore them. However, this is not required for the current assignment.

In [1]:
# Example 4: Random Forest with Cross Validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import cross_validate, KFold

X = df[['arch_length_discrepancy', 'num_incisors', 'num_canines', 'num_premolars', 'num_molars']]
y = (df['label'] == 'invisalign').astype(int)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

cv = KFold(n_splits=5, shuffle=True, random_state=42)

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1'
}

cv_results = cross_validate(rf_model, X, y, cv=cv, scoring=scoring)

print("Cross Validation Results:")
print(f"Accuracy: {cv_results['test_accuracy'].mean():.4f} (+/- {cv_results['test_accuracy'].std() * 2:.4f})")
print(f"Precision: {cv_results['test_precision'].mean():.4f} (+/- {cv_results['test_precision'].std() * 2:.4f})")
print(f"Recall: {cv_results['test_recall'].mean():.4f} (+/- {cv_results['test_recall'].std() * 2:.4f})")
print(f"F1 Score: {cv_results['test_f1'].mean():.4f} (+/- {cv_results['test_f1'].std() * 2:.4f})")

rf_model.fit(X, y)
feature_importance = pd.DataFrame(rf_model.feature_importances_, 
                            index=X.columns, 
                            columns=['importance']).sort_values('importance', ascending=False)
print("\nFeature Importance:")
print(feature_importance)


NameError: name 'df' is not defined

## <font color="Purple"> Part 4: Documentation </font>

Provide a clear and concise summary of your data analysis, including the interpretation of the results. This should explain the key insights gained from the analyses and their relevance to your research questions.

Requirements:

- Write at least one paragraph summarizing your findings from the data analysis.
- Highlight significant results and their interpretations.
- Optional: Discuss any limitations or challenges encountered during the analysis.
  
Example Structure:

- What you found: Summarize key results (e.g., correlations, group differences, or significant relationships).
- Why it matters: Explain how these findings address your research questions or contribute to your understanding of the data.
- Optional: What’s next: suggest any follow-up steps or analyses based on your findings.
  
Don’t forget to complete the documentation, as it helps consolidate your work and ensures clear communication of your results.

# Summary Findings

## Dataset Overview
The dataset was found to be imbalanced, with the dominant label, "Basic," accounting for 63% of total observations. This category corresponds to treatments primarily addressing front-tooth corrections, which are typically less complex and the most cost-effective. The remaining observations were distributed almost equally across the other categories. Recognizing this imbalance, we prioritized insights that align with industrial standards and focused on arch length discrepancy, a critical feature representing the difference between the ideal and actual dental arch lengths.

## Statistical Insights
To examine the relationship between treatment plans and arch length discrepancy, we conducted statistical tests:

- ANOVA: Results (p=1.90e-67) confirmed highly significant differences between treatment groups, underscoring the importance of arch length discrepancy in categorizing treatments.
- T-tests: Pairwise comparisons revealed: No statistically significant difference between the "Basic" and "Pro" treatment plans (p=0.572), indicating similarities in detected and simulated arch lengths.
Significant differences between "Basic" and other treatment categories ("Invisalign" and "Regrettable"), demonstrating the utility of arch length discrepancy as a distinguishing feature in these cases.
These findings suggest that while "Pro" treatment involves molars and premolars (likely for chewing or functional issues), patients may prioritize cost-effective options like "Basic" treatments, influenced by affordability or other factors.

## Model Development and Evaluation
Given the multi-category nature of the target variable, we framed the problem as a multi-class classification task. Initial modeling efforts included:

OLS Regression and Logistic Regression: Despite extensive exploration of numerical variables, these models showed limited explainability, as indicated by low R-squared values.
Random Forest Classifier: Recognizing the limitations of linear approaches, we implemented a Random Forest model, which demonstrated significantly stronger predictive performance:
- Accuracy: ~0.89
- Precision: ~0.88
- Recall: ~0.87
- F1 Score: ~0.88
These metrics highlight the model’s robustness and its ability to handle the dataset’s inherent class imbalance. However, the imbalanced nature of the data necessitates cautious interpretation of these metrics, particularly for minority classes.

## Key Takeaways
Arch Length Discrepancy: A pivotal feature for distinguishing between treatment categories, particularly "Basic" vs. "Invisalign" or "Regrettable."
Linear vs. Non-Linear Models: While linear models failed to capture the complexity of the problem, Random Forest emerged as a powerful alternative, suggesting non-linear relationships in the data.
Practical Implications: Patients may opt for cost-effective "Basic" treatments over "Pro" due to financial considerations or personal preferences, even when discrepancies are apparent.
## Next Steps
Class Imbalance Handling: Apply techniques such as oversampling, undersampling, or cost-sensitive learning to improve performance on minority classes.
Explainability and Deployment: Investigate interpretability tools (e.g., SHAP values) to understand feature importance, which could support clinical decision-making and enhance patient communication.

In [16]:
df.to_csv('final_dataset.csv', index=False)