# Analysis of FDA Citations

Source: https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/inspection-references/inspection-citation

The dataset pulled from fda.gov details citations made during FDA inspections conducted of clinical trials, Institutional Review Boards (IRBs), and facilities that manufacture, process, pack, or hold an FDA-regulated product that is currently marketed. Specifically, the information comes from the electronic inspection tool of FDA Form 483.

In [17]:
# import packages
import gensim
import matplotlib as mpl
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
# read dataframe into python
df = pd.read_excel("Inspection_Citation_(10-1-2008_through_7-22-2020)_0.xlsx")

In [39]:
df.head()

Unnamed: 0,Firm Name,City,State,Country/Area,Inspection End Date,Program Area,CFR/Act Number,Short Description,Long Description
0,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.20(b)(4),"Floors, walls and ceilings",The plant is not constructed in such a manner ...
1,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.20(b)(5),Safety lighting and glass,Failure to provide safety-type lighting fixtur...
2,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.35(a),Buildings/good repair,Failure to maintain buildings in repair suffic...
3,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.35(a),Cleaning and sanitizing operations,Failure to conduct cleaning and sanitizing ope...
4,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.80(a)(1),Storage,Failure to store raw materials in a manner tha...


In [58]:
# tidy dataframe to only contain citations from the Drugs 'Program Area'
## filter for only observations from the Drug program area in USA.
drugs = df[(df['Program Area'] == "Drugs") &
           (df['Country/Area'] == "United States")].reset_index().drop('index', axis = 1)

## tidy up the CFR/Act Number column to generalize to the CFR chapter.
CFR = []
for values in drugs['CFR/Act Number']:
    try:
        x = re.search(r'21 CFR [0-9]+\.[0-9]+', values).group()
    except:
        x = None
    CFR.append(x)
    
drugs['CFR'] = CFR

citations = drugs['Long Description']

citations

0        Routine calibration of mechanical equipment is...
1        Drug products are not stored under appropriate...
2        Returned drug products held, stored or shipped...
3        Samples taken of in-process materials for dete...
4        Procedures describing the calibration of instr...
                               ...                        
28374    Separate or defined areas to prevent contamina...
28375    The responsibilities and procedures applicable...
28376    Written procedures are not followed for  evalu...
28377    There is no written testing program designed t...
28378    Buildings used in the manufacture, processing,...
Name: Long Description, Length: 28379, dtype: object

## Topic Modeling of Long Descriptions

For each long description in df, extract a list of topics that are most likely to be cited during an FDA audit.