# Analysis of FDA Citations

Source: https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/inspection-references/inspection-citation

The dataset pulled from fda.gov details citations made during FDA inspections conducted of clinical trials, Institutional Review Boards (IRBs), and facilities that manufacture, process, pack, or hold an FDA-regulated product that is currently marketed. Specifically, the information comes from the electronic inspection tool of FDA Form 483.

In [1]:
# import packages
import gensim
import matplotlib as mpl
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
import pickle
import re
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# read dataframe into python
df = pd.read_excel("Inspection_Citation_(10-1-2008_through_7-22-2020)_0.xlsx")

In [3]:
df.head()

Unnamed: 0,Firm Name,City,State,Country/Area,Inspection End Date,Program Area,CFR/Act Number,Short Description,Long Description
0,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.20(b)(4),"Floors, walls and ceilings",The plant is not constructed in such a manner ...
1,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.20(b)(5),Safety lighting and glass,Failure to provide safety-type lighting fixtur...
2,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.35(a),Buildings/good repair,Failure to maintain buildings in repair suffic...
3,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.35(a),Cleaning and sanitizing operations,Failure to conduct cleaning and sanitizing ope...
4,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.80(a)(1),Storage,Failure to store raw materials in a manner tha...


In [4]:
# tidy dataframe to only contain citations from the Drugs 'Program Area'
## filter for only observations from the Drug program area in USA.
drugs = df[(df['Program Area'] == "Drugs") &
           (df['Country/Area'] == "United States")].reset_index().drop('index', axis = 1)

## tidy up the CFR/Act Number column to generalize to the CFR chapter.
CFR = []
for values in drugs['CFR/Act Number']:
    try:
        x = re.search(r'21 CFR [0-9]+\.[0-9]+', values).group()
    except:
        x = None
    CFR.append(x)
drugs['CFR'] = CFR

# create a series of citations
citations = drugs['Long Description']

In [29]:
drugs.shape

(28379, 10)

## Topic Modeling of Long Descriptions

For each long description in df, extract a list of topics that are most likely to be cited during an FDA audit.

In [16]:
cites = list(citations)
tokens = []
for i in cites:
    token = nltk.word_tokenize(i)
    tokens.append(token)
    
tokens[:2]

[['Routine',
  'calibration',
  'of',
  'mechanical',
  'equipment',
  'is',
  'not',
  'performed',
  'according',
  'to',
  'a',
  'written',
  'program',
  'designed',
  'to',
  'assure',
  'proper',
  'performance',
  '.'],
 ['Drug',
  'products',
  'are',
  'not',
  'stored',
  'under',
  'appropriate',
  'conditions',
  'of',
  'temperature',
  'so',
  'that',
  'their',
  'identity',
  ',',
  'strength',
  ',',
  'quality',
  ',',
  'and',
  'purity',
  'are',
  'not',
  'affected',
  '.']]

In [26]:
# Use CountVectorizer to find tokens, remove stop words,
# remove tokens that don't appear in at least 25 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df = 25, max_df = 0.2, ngram_range = (1,3),
                       stop_words = 'english')

# Fit and transform
X = vect.fit_transform(cites)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns = False)

# Mapping from word IDs to words (TO be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In [27]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to variable 'ldamodel'

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10,
                                          id2word = id_map, passes = 25,
                                          random_state = 4)

In [28]:
ldamodel.print_topics(10)

[(0,
  '0.016*"records" + 0.013*"appropriate" + 0.012*"master" + 0.012*"master production control" + 0.012*"master production" + 0.012*"testing" + 0.011*"stability" + 0.011*"used" + 0.010*"production" + 0.010*"production control records"'),
 (1,
  '0.011*"training" + 0.011*"employees" + 0.010*"use" + 0.008*"assure" + 0.008*"applicable" + 0.008*"identity" + 0.007*"operations" + 0.007*"manufacturing" + 0.007*"good" + 0.007*"good manufacturing"'),
 (2,
  '0.020*"records" + 0.014*"include" + 0.012*"suitable" + 0.012*"instruments" + 0.011*"calibration" + 0.009*"calibration instruments" + 0.009*"laboratory" + 0.008*"established" + 0.008*"complaint" + 0.008*"cleaning"'),
 (3,
  '0.036*"records" + 0.030*"batch" + 0.025*"production" + 0.023*"production control" + 0.022*"include" + 0.019*"control records" + 0.018*"production control records" + 0.014*"batch production" + 0.014*"batch production control" + 0.013*"complete"'),
 (4,
  '0.014*"established" + 0.014*"cleaning" + 0.013*"equipment" + 0.0