# Analysis of FDA Citations

Source: https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/inspection-references/inspection-citation

The dataset pulled from fda.gov details citations made during FDA inspections conducted of clinical trials, Institutional Review Boards (IRBs), and facilities that manufacture, process, pack, or hold an FDA-regulated product that is currently marketed. Specifically, the information comes from the electronic inspection tool of FDA Form 483.

In [1]:
# import packages
import gensim
import matplotlib as mpl
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
import pickle
import re
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# read dataframe into python
df = pd.read_excel("Inspection_Citation_(10-1-2008_through_7-22-2020)_0.xlsx")

In [3]:
df.head()

Unnamed: 0,Firm Name,City,State,Country/Area,Inspection End Date,Program Area,CFR/Act Number,Short Description,Long Description
0,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.20(b)(4),"Floors, walls and ceilings",The plant is not constructed in such a manner ...
1,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.20(b)(5),Safety lighting and glass,Failure to provide safety-type lighting fixtur...
2,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.35(a),Buildings/good repair,Failure to maintain buildings in repair suffic...
3,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.35(a),Cleaning and sanitizing operations,Failure to conduct cleaning and sanitizing ope...
4,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.80(a)(1),Storage,Failure to store raw materials in a manner tha...


In [4]:
# tidy dataframe to only contain citations from the Drugs 'Program Area'
## filter for only observations from the Drug program area in USA.
drugs = df[(df['Program Area'] == "Drugs") &
           (df['Country/Area'] == "United States")].reset_index().drop('index', axis = 1)

## tidy up the CFR/Act Number column to generalize to the CFR chapter.
CFR = []
for values in drugs['CFR/Act Number']:
    try:
        x = re.search(r'21 CFR [0-9]+\.[0-9]+', values).group()
    except:
        x = None
    CFR.append(x)
drugs['CFR'] = CFR

# create a series of citations
citations = drugs['Long Description']

In [5]:
drugs.shape

(28379, 10)

## Topic Modeling of Long Descriptions

For each long description in df, extract a list of topics that are most likely to be cited during an FDA audit.

In [6]:
cites = list(citations)
tokens = []
for i in cites:
    token = nltk.word_tokenize(i)
    tokens.append(token)
    
tokens[:1]

[['Routine',
  'calibration',
  'of',
  'mechanical',
  'equipment',
  'is',
  'not',
  'performed',
  'according',
  'to',
  'a',
  'written',
  'program',
  'designed',
  'to',
  'assure',
  'proper',
  'performance',
  '.']]

In [7]:
# determine stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words[:5]
stop_words.extend(['deficient', 'used', 'purporting', 'meet'])
stop_words[:-5:-1]

['meet', 'purporting', 'used', 'deficient']

In [73]:
# Use CountVectorizer to find tokens, remove stop words,
# remove tokens that don't appear in at least 25 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df = 25, max_df = 0.2, ngram_range = (1,2),
                       stop_words = stop_words)

# Fit and transform
X = vect.fit_transform(cites)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns = False)

# Mapping from word IDs to words (TO be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In [74]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to variable 'ldamodel'

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 25,
                                          id2word = id_map, passes = 25,
                                          random_state = 4)

In [75]:
ldamodel.print_topics(30)

[(0,
  '0.051*"purity" + 0.051*"strength" + 0.050*"strength quality" + 0.047*"quality purity" + 0.047*"identity strength" + 0.046*"identity" + 0.034*"production" + 0.033*"process" + 0.032*"written procedures" + 0.032*"assure"'),
 (1,
  '0.057*"release" + 0.037*"include" + 0.034*"laboratory" + 0.032*"testing" + 0.029*"conformance" + 0.029*"determination" + 0.029*"distribution" + 0.028*"appropriate" + 0.028*"prior" + 0.028*"prior release"'),
 (2,
  '0.025*"strength" + 0.024*"quality purity" + 0.024*"purity" + 0.024*"identity" + 0.024*"identity strength" + 0.024*"strength quality" + 0.024*"safety" + 0.024*"equipment" + 0.023*"alter" + 0.023*"purity drug"'),
 (3,
  '0.030*"cleaning" + 0.029*"equipment" + 0.028*"written procedures" + 0.025*"maintenance" + 0.024*"containers" + 0.024*"sufficient" + 0.024*"sufficient detail" + 0.024*"detail" + 0.023*"cleaning maintenance" + 0.022*"product containers"'),
 (4,
  '0.090*"unit" + 0.083*"control unit" + 0.083*"quality control" + 0.051*"applicable q

In [40]:
new_doc = ['Aseptic processing areas are deficient regarding the system for monitoring environmental conditions.']

X = vect.transform(new_doc)
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns = False)
topic_dis = list(ldamodel[corpus])[0]

topic_dis

[(19, 0.94062454)]

### Topic Modeling Evaluation

__Log Perplexity__  
n_topics = 10 --> ~ -1400  
n_topics = 20 --> -1577.34  
n_topics = 25 --> -5.32  
n_topics = 30 --> -5.25

Numbers closer to zero may be better.

In [76]:
ldamodel.log_perplexity(corpus)

-5.322630179479673

__Top Topics__

In [79]:
ldamodel.top_topics(corpus, topn = 3)

[([(0.09003838, 'unit'),
   (0.0829901, 'control unit'),
   (0.08292807, 'quality control')],
  -0.016441441580521896),
 ([(0.05072036, 'purity'),
   (0.05055937, 'strength'),
   (0.050143704, 'strength quality')],
  -0.05528420379113597),
 ([(0.046131957, 'holding'),
   (0.04597108, 'packing'),
   (0.045562748, 'processing packing')],
  -0.07846425823277849),
 ([(0.0249014, 'strength'),
   (0.024311759, 'quality purity'),
   (0.024303172, 'purity')],
  -0.10836784928111425),
 ([(0.059587367, 'aseptic'),
   (0.04569817, 'areas'),
   (0.04263366, 'conditions')],
  -0.3539871627137067),
 ([(0.057410963, 'release'),
   (0.03734178, 'include'),
   (0.034061547, 'laboratory')],
  -0.40907257923842427),
 ([(0.029809246, 'perform'), (0.021748058, 'engaged'), (0.020135572, 'lack')],
  -0.6999164907174339),
 ([(0.03635957, 'contamination'),
   (0.035322893, 'prevent'),
   (0.030174054, 'designed')],
  -0.7153028133566353),
 ([(0.07726238, 'failure'), (0.07352669, 'batch'), (0.03889419, 'whether

In [80]:
# assign each citation to the most probable topic
t2 = []
for line in cites:
    t = [line]
    T = vect.transform(t)
    corpus = gensim.matutils.Sparse2Corpus(T, documents_columns = False)
    T_dis = list(ldamodel[corpus])[0]
    t2.append(max(T_dis))

In [81]:
# join the predicted topics to the drugs df.
s_t2 = pd.Series(t2)

drugs2 = drugs

drugs2['topic_number'] = [n for (n, p) in s_t2.values]

drugs2['topic_probability'] = [p for (n, p) in s_t2.values]

# rename 'Long Description' column to a more readable format
drugs3 = drugs2.rename(columns={'Long Description': 'long_description'})

drugs3.head()

Unnamed: 0,Firm Name,City,State,Country/Area,Inspection End Date,Program Area,CFR/Act Number,Short Description,long_description,CFR,topic_number,topic_probability
0,AmerisourceBergen Drug Corporation,Des Moines,WA,United States,2008-10-07,Drugs,21 CFR 211.68(a),Calibration/Inspection/Checking not done,Routine calibration of mechanical equipment is...,21 CFR 211.68,12,0.958261
1,AmerisourceBergen Drug Corporation,Des Moines,WA,United States,2008-10-07,Drugs,21 CFR 211.142(b),Storage under appropriate conditions,Drug products are not stored under appropriate...,21 CFR 211.142,20,0.557778
2,AmerisourceBergen Drug Corporation,Des Moines,WA,United States,2008-10-07,Drugs,21 CFR 211.204,Returned drug products with doubt cast as to s...,"Returned drug products held, stored or shipped...",21 CFR 211.204,20,0.082177
3,"IPR Pharmaceuticals, Inc.",Canovanas,PR,United States,2008-10-08,Drugs,21 CFR 211.160(b)(2),"In-process samples representative, identified ...",Samples taken of in-process materials for dete...,21 CFR 211.160,14,0.926153
4,"IPR Pharmaceuticals, Inc.",Canovanas,PR,United States,2008-10-08,Drugs,21 CFR 211.160(b)(4),Establishment of calibration procedures,Procedures describing the calibration of instr...,21 CFR 211.160,12,0.741772


In [85]:
drugs3.groupby('topic_number').count()['long_description']

topic_number
0      800
1      689
2     1037
3      713
4     1727
5      553
6      264
7     1357
8     1101
9      856
10     683
11     562
12    1220
13     814
14     927
15    1301
16    1449
17     802
18    1385
19    1310
20    1695
21     856
22    1508
23    2758
24    2012
Name: long_description, dtype: int64

In the citations df, there are a total of 3,245 unique citations in the ~ 30,000 total citations.

In [102]:
drugs3.head()

all_topics = list(drugs3.long_description)
unique_topics = []
for x in all_topics:
    if x not in unique_topics:
        unique_topics.append(x)
        
len(unique_topics)

3245

In [97]:
# get list of all unique long descriptions for each topic

## topic 2
topic2 = list(drugs3[drugs3['topic_number'] == 2].long_description)
unique_topic2 = []
for x in topic2:
    if x not in unique_topic2:
        unique_topic2.append(x)
#print("\nTopic 2")
#print(unique_topic2)

## topic 3
topic3 = list(drugs3[drugs3['topic_number'] == 3].long_description)
unique_topic3 = []
for x in topic3:
    if x not in unique_topic3:
        unique_topic3.append(x)
#print("\nTopic 3")
#print(unique_topic3)

## topic 15
topic15 = list(drugs3[drugs3['topic_number'] == 15].long_description)
unique_topic15 = []
for x in topic15:
    if x not in unique_topic15:
        unique_topic15.append(x)
#print("\nTopic 15")
#print(unique_topic15)

## topic 16
topic16 = list(drugs3[drugs3['topic_number'] == 16].long_description)
unique_topic16 = []
for x in topic16:
    if x not in unique_topic16:
        unique_topic16.append(x)
#print("\nTopic 16")
#print(unique_topic16)

## topic 22
topic22 = list(drugs3[drugs3['topic_number'] == 22].long_description)
unique_topic22 = []
for x in topic22:
    if x not in unique_topic22:
        unique_topic22.append(x)
#print("\nTopic 22")
#print(unique_topic22)

In [94]:
topics = ['Purity and Strength', 'QC Testing Prior to Release']

# CFR Predicting

When the FDA provides a facility with a list of observations on Form 483, they do _not_ link the associated chapter from 21 CFR 211. However, the citations DataFrame pulled from fda.gov _does_ contain a reference to the 21 CFR chapter. Therefore, it should be possible to train a model to predict the 21 CFR 211 chapter based on the citation given.