### Order in which code is to be ran

Dataset_preprocessing.ipynb --->  EDA.ipynb ---> Feature_Engineering.ipynb ---> models.ipynb ----> models.ipynb ----> neural-network.ipynb

Many Dataset files will be created in running this code could'nt submit those file of 50 mb zip file limit

### Step : Importing Libraries
This section imports the necessary Python libraries:
- `pandas` for data manipulation and analysis
- `scikit-learn` for machine learning utilities like dataset splitting
- `textblob` for sentiment analysis
- `os` for handling file paths

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from textblob import TextBlob
import os


In [2]:
file_path = 'Dataset/usecase_4_.xlsx'
df = pd.read_excel(file_path)


In [3]:
print("Initial DataFrame:")
print(df.head())


Initial DataFrame:
    NCT Number                                        Study Title  \
0  NCT00900809  QUILT-3.018: Neukoplastâ„¢ (NK-92) for the Tre...   
1  NCT01113515  Clinical Investigation of GalnobaxÂ® for the T...   
2  NCT01288573  A Combined Study in Pediatric Cancer Patients ...   
3  NCT01336660  A Trial of Equine F (ab')2 Antivenom for Treat...   
4  NCT01376167  Ph 2B/3 Tafenoquine (TFQ) Study in Prevention ...   

                                      Study URL Study Status  \
0  https://clinicaltrials.gov/study/NCT00900809    COMPLETED   
1  https://clinicaltrials.gov/study/NCT01113515    COMPLETED   
2  https://clinicaltrials.gov/study/NCT01288573    COMPLETED   
3  https://clinicaltrials.gov/study/NCT01336660    COMPLETED   
4  https://clinicaltrials.gov/study/NCT01376167    COMPLETED   

                                       Brief Summary Study Results  \
0  NK cells from patients with malignant diseases...            NO   
1  The purpose of this study is to determ

### Step : Checking Dataset Dimensions
Displays the number of rows and columns in the dataset to understand its size.

In [4]:
df.shape

(20676, 29)

### Step : Identifying Column Types
Classifies columns into the following categories:
- **Categorical columns**: Columns with fewer unique values (e.g., categories).
- **Text columns**: Columns containing textual data.
- **Date columns**: Columns with date or time information.

In [5]:
categorical_columns = []
text_columns = []
date_columns = []

for column in df.columns:
    if df[column].dtype == 'object':
        unique_values = df[column].nunique()
        if unique_values < 50: 
            categorical_columns.append(column)
        elif 'date' in column.lower() or 'time' in column.lower():  
            date_columns.append(column)
        else:
            text_columns.append(column)


In [6]:
print("\nIdentified column types:")
print(f"Categorical columns: {categorical_columns}")
print(f"Text columns: {text_columns}")
print(f"Date columns: {date_columns}")



Identified column types:
Categorical columns: ['Study Status', 'Study Results', 'Sex', 'Age', 'Phases', 'Funder Type', 'Study Type']
Text columns: ['NCT Number', 'Study Title', 'Study URL', 'Brief Summary', 'Conditions', 'Interventions', 'Primary Outcome Measures', 'Secondary Outcome Measures', 'Other Outcome Measures', 'Sponsor', 'Collaborators', 'Study Design', 'Other IDs', 'Locations']
Date columns: ['Primary Completion Date']


In [7]:
print("\nEncoding categorical columns...")
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
print("Categorical columns encoded.")


Encoding categorical columns...
Categorical columns encoded.


Perfoeming Sentimental analysis on Text and laater will map to integers

In [8]:
print("\nPerforming sentiment analysis on text columns...")
def sentiment_analysis(text):
    if isinstance(text, str):
        return TextBlob(text).sentiment.polarity
    else:
        return 0 


Performing sentiment analysis on text columns...


In [9]:
for column in text_columns:
    df[f'{column} Sentiment'] = df[column].apply(sentiment_analysis)

In [10]:
df = df.drop(text_columns, axis=1)
print("Sentiment analysis completed.")

Sentiment analysis completed.


In [11]:
print("\nConverting date columns to datetime and extracting year, month, and day...")
for column in date_columns:
    df[column] = pd.to_datetime(df[column], errors='coerce')
    df[f'{column} Year'] = df[column].dt.year
    df[f'{column} Month'] = df[column].dt.month
    df[f'{column} Day'] = df[column].dt.day


Converting date columns to datetime and extracting year, month, and day...


In [12]:
df = df.drop(date_columns, axis=1)
print("Date columns converted and broken into multiple pieces.")

Date columns converted and broken into multiple pieces.


Checkig Boolean columns and mapping it to {0,1}

In [13]:
boolean_columns = []
for column in df.columns:
    if df[column].dtype == 'bool':
        boolean_columns.append(column)


In [14]:
print("\nConverting boolean columns to 0 and 1...")
for column in boolean_columns:
    df[column] = df[column].astype(int)
print("Boolean columns converted.")



Converting boolean columns to 0 and 1...
Boolean columns converted.


In [15]:
df

Unnamed: 0,Enrollment,Start Date,Completion Date,First Posted,Results First Posted,Last Update Posted,Study Recruitment Rate,Study Status_COMPLETED,Study Status_ENROLLING_BY_INVITATION,Study Status_NOT_YET_RECRUITING,...,Secondary Outcome Measures Sentiment,Other Outcome Measures Sentiment,Sponsor Sentiment,Collaborators Sentiment,Study Design Sentiment,Other IDs Sentiment,Locations Sentiment,Primary Completion Date Year,Primary Completion Date Month,Primary Completion Date Day
0,7,2014-05-12,2015-06-02,2009-05-13,NaT,2022-04-05,0.551598,1,0,0,...,0.200000,0.000,0.000000,0.0,0.0,0.0,-0.100000,2015,6,2
1,44,2014-02-20,2015-10-17,2010-04-30,2024-05-20,2024-05-20,0.443157,1,0,0,...,0.000000,0.125,-0.035714,0.0,0.0,0.0,0.018182,2015,6,27
2,46,2014-03-03,2017-05-09,2011-02-02,NaT,2017-05-16,0.044558,1,0,0,...,-0.034848,0.000,0.000000,0.0,0.0,0.0,0.166667,2017,5,9
3,56,2018-07-21,2018-11-15,2011-04-18,NaT,2018-12-13,7.279202,1,0,0,...,-0.250000,0.000,0.000000,0.0,0.0,0.0,0.000000,2018,11,1
4,851,2014-04-24,2016-11-18,2011-06-20,2018-04-23,2018-04-23,1.969008,1,0,0,...,0.060112,0.000,0.000000,0.0,0.0,0.0,0.000000,2016,11,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20671,108,2021-05-17,2022-07-18,2021-05-12,NaT,2023-04-26,7.693208,1,0,0,...,-0.272619,0.000,0.000000,0.0,0.0,0.0,0.000000,2022,7,18
20672,66,2015-08-04,2019-07-12,2015-05-25,NaT,2019-08-29,0.199434,1,0,0,...,0.126488,0.000,0.000000,0.0,0.0,0.0,-0.025253,2019,7,12
20673,261,2018-08-30,2020-09-07,2019-02-22,NaT,2021-05-11,1.074256,1,0,0,...,0.000000,0.000,0.000000,0.0,0.0,0.0,-0.040000,2020,5,12
20674,12,2014-01-14,2015-10-13,2013-11-11,2021-03-26,2021-03-26,0.044077,0,0,0,...,0.062864,0.000,0.000000,0.0,0.4,0.0,-0.015909,2015,9,1


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20676 entries, 0 to 20675
Data columns (total 50 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   Enrollment                            20676 non-null  int64         
 1   Start Date                            20676 non-null  datetime64[ns]
 2   Completion Date                       20675 non-null  datetime64[ns]
 3   First Posted                          20676 non-null  datetime64[ns]
 4   Results First Posted                  7674 non-null   datetime64[ns]
 5   Last Update Posted                    20676 non-null  datetime64[ns]
 6   Study Recruitment Rate                20676 non-null  float64       
 7   Study Status_COMPLETED                20676 non-null  int64         
 8   Study Status_ENROLLING_BY_INVITATION  20676 non-null  int64         
 9   Study Status_NOT_YET_RECRUITING       20676 non-null  int64         
 10

In [17]:
print("\nSplitting data into features and target...")
X = df.drop('Study Recruitment Rate', axis=1)
y = df['Study Recruitment Rate']


Splitting data into features and target...


In [18]:
full_df = pd.concat([X, y], axis=1)
output_dir = 'preprocessed-dataset'

full_df.to_csv(os.path.join(output_dir, 'full.csv'), index=False)
print("Full dataset saved as full.csv.")

Full dataset saved as full.csv.


In [19]:
print("\nSplitting data into training, validation, and test sets...")
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
print("Data split completed.")



Splitting data into training, validation, and test sets...
Data split completed.


In [20]:
output_dir = 'preprocessed-dataset'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)



In [21]:
print("\nSaving preprocessed datasets...")
train_data = pd.concat([X_train, y_train], axis=1)
val_data = pd.concat([X_val, y_val], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)




Saving preprocessed datasets...


In [22]:
train_data.to_csv(os.path.join(output_dir, 'train.csv'), index=False)
val_data.to_csv(os.path.join(output_dir, 'val.csv'), index=False)
test_data.to_csv(os.path.join(output_dir, 'test.csv'), index=False)
print("Preprocessed datasets saved.")

print("\nData preprocessing completed.")

Preprocessed datasets saved.

Data preprocessing completed.


### Step : Checking Dataset Dimensions
Displays the number of rows and columns in the dataset to understand its size.

In [23]:
train_data.shape, val_data.shape, test_data.shape

((14473, 50), (3101, 50), (3102, 50))