**PRE-ANALYSIS**

First of all, we will start exploring our dataset in order to familiarize with the different rows and columns. After this analysis, we will be able to notice whether it is needed or not to clean and transform the available data. Moreover, we will come up with interesting questions that me be answered after analysing it deeply.

In [50]:
#Libraries importation

import pandas as pd #data transformation and manipulation.
pd.set_option("display.max_columns", None) #to display all columns of the dataframe.
import numpy as np #to work with python numbers
import seaborn as sns #visualizations
import matplotlib.pyplot as plt #visualizations

In [51]:
#Connection to our dataset, which consists on 3 files in txt format. Each file contains one unique table. Lt's connect to the first file: exclusivities.

df_ex = pd.read_csv("../1_Data/Data_Raw/exclusivity_raw.txt", delimiter="~") #we checked the delimiter by opening the txt.
df_ex.head()

Unnamed: 0,Appl_Type,Appl_No,Product_No,Exclusivity_Code,Exclusivity_Date
0,N,17031,1,RTO,"Jul 13, 2026"
1,N,18680,1,D-193,"Jun 28, 2027"
2,N,20263,9,NS,"Apr 14, 2026"
3,N,20825,1,M-232,"Jan 28, 2025"
4,N,20825,2,M-232,"Jan 28, 2025"


In [52]:
df_ex.tail()

Unnamed: 0,Appl_Type,Appl_No,Product_No,Exclusivity_Code,Exclusivity_Date
2411,N,215888,1,GAIN,"Apr 26, 2032"
2412,N,218275,1,GAIN,"Apr 3, 2034"
2413,N,213004,1,GAIN,"Nov 1, 2027"
2414,N,213972,1,GAIN,"Oct 25, 2034"
2415,N,217417,1,GAIN,"Mar 22, 2033"


In [53]:
#Lets retrieve more information about this txt file we have just charged (only exclusivities).

print(f"The file has {df_ex.shape[0]} rows and {df_ex.shape[1]} columns.")
df_ex.info()

The file has 2416 rows and 5 columns.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2416 entries, 0 to 2415
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Appl_Type         2416 non-null   object
 1   Appl_No           2416 non-null   int64 
 2   Product_No        2416 non-null   int64 
 3   Exclusivity_Code  2416 non-null   object
 4   Exclusivity_Date  2416 non-null   object
dtypes: int64(2), object(3)
memory usage: 94.5+ KB


**Notes**

We can see that the file is pretty clean and complete: it has not null values and column names are standarized (the first letter of each word in capital letters and spaces are substituted by an underscore). However, regarding data types, the Exclusivity_Date column should be Datetime instead of Object. Moreover, the columns Appl_Type and Exclusivity_Code should be transformed into a more comprehensive format for the analysis. Also it would be interesting to search for duplicates.

In [54]:
#Check for duplicated rows.

print(f"There are {df_ex.duplicated().sum()} duplicates in this dataframe.")

There are 18 duplicates in this dataframe.


In [55]:
#Let's remove duplicates

df_ex_uniq = df_ex.drop_duplicates()
print(f"There are {df_ex_uniq.duplicated().sum()} duplicates in this dataframe.")

There are 0 duplicates in this dataframe.


In [56]:
#Now let's check for the values of the columns
pd.set_option("display.max_rows",None) #to display maximum possible rows.
display(df_ex_uniq['Appl_Type'].unique()) #To know which are the unique values for application types.
display(df_ex_uniq['Appl_Type'].value_counts()) #To count how many times appears each unique value.
display(df_ex_uniq['Exclusivity_Code'].unique()) #To know which are the unique values for exclusivity codes.
display(df_ex_uniq['Exclusivity_Code'].value_counts()) #To count how many times appears each unique value.

array(['N', 'A'], dtype=object)

Appl_Type
N    2309
A      89
Name: count, dtype: int64

array(['RTO', 'D-193', 'NS', 'M-232', 'ODE-267', 'NPP', 'ODE-164',
       'ODE-225', 'M-187', 'D-186', 'I-861', 'ODE-171', 'ODE-172',
       'ODE-380', 'ODE-156', 'ODE-231', 'ODE-167', 'ODE*', 'I-859',
       'ODE-420', 'ODE-417', 'ODE-421', 'M-287', 'ODE-469', 'ODE-131',
       'ODE-241', 'ODE-245', 'I-939', 'D-188', 'M-300', 'ODE-210',
       'ODE-315', 'I-856', 'I-867', 'I-858', 'I-848', 'M-283', 'M-271',
       'I-929', 'I-862', 'ODE-345', 'M-308', 'I-889', 'ODE-399',
       'ODE-360', 'ODE-294', 'M-14', 'M-295', 'M-280', 'ODE-367', 'I-915',
       'ODE-260', 'NCE', 'ODE-136', 'ODE-135', 'ODE-145', 'I-852',
       'ODE-328', 'ODE-407', 'I-897', 'I-855', 'ODE-182', 'ODE-183',
       'ODE-147', 'ODE-428', 'I-908', 'I-894', 'I-921', 'M-61', 'ODE-139',
       'ODE-189', 'ODE-190', 'ODE-199', 'ODE-338', 'I-879', 'ODE-157',
       'ODE-163', 'ODE-444', 'I-923', 'ODE-162', 'I-926', 'ODE-240',
       'I-934', 'ODE-472', 'ODE-169', 'ODE-297', 'ODE-296', 'ODE-148',
       'I-895', 'ODE-251',

Exclusivity_Code
NCE        421
PED        278
NPP        164
NP         151
CGT         63
ODE*        49
GAIN        43
PC          26
M-14        25
M-82        23
M-232       12
M-187       11
NS           9
M-275        8
ODE-428      7
I-939        7
ODE-405      7
ODE-360      6
D-188        6
ODE-245      6
M-295        6
ODE-131      6
ODE-241      6
ODE-495      6
I-913        6
ODE-164      6
ODE-225      6
I-920        6
M-307        6
ODE-497      6
I-925        6
ODE-439      5
I-897        5
ODE-182      5
ODE-183      5
I-908        5
ODE-444      5
ODE-272      5
ODE-501      5
ODE-238      5
M-285        5
ODE-178      5
ODE-366      5
I-935        5
ODE-268      5
ODE-434      5
ODE-356      5
ODE-130      5
ODE-252      5
ODE-435      5
I-852        5
ODE-373      5
ODE-417      5
I-859        5
W            5
ODE-210      5
M-300        5
M-296        4
ODE-296      4
ODE-165      4
ODE-503      4
I-934        4
ODE-472      4
I-863        4
ODE-297      4
I-918   

We can see there are just two options for the application Type (Appl_Type): N and A. It would be interesting to substitute them by more recognisable labels. Exclusivity_Code column has 536 different values. We should group them in broader exclusivity groups.

In [57]:
#We substitute N by Innovator as it stands for NDA (New Drug Application), A by Generic as it stands for ANDA (Abbreviated New Drug Application). Finally every different value or nulls stay the same.
df_ex['Appl_Type'] = df_ex['Appl_Type'].apply(lambda x: 'Innovator' if x=='N' else 'Generic' if x=='A' else 'NaN' if pd.isnull(x) else x)

In [58]:
display(df_ex.head())

Unnamed: 0,Appl_Type,Appl_No,Product_No,Exclusivity_Code,Exclusivity_Date
0,Innovator,17031,1,RTO,"Jul 13, 2026"
1,Innovator,18680,1,D-193,"Jun 28, 2027"
2,Innovator,20263,9,NS,"Apr 14, 2026"
3,Innovator,20825,1,M-232,"Jan 28, 2025"
4,Innovator,20825,2,M-232,"Jan 28, 2025"


In [59]:
df_ex['Appl_Type'].value_counts()

Appl_Type
Innovator    2327
Generic        89
Name: count, dtype: int64