# Analysis of FDA Citations

Source: https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/inspection-references/inspection-citation

The dataset pulled from fda.gov details citations made during FDA inspections conducted of clinical trials, Institutional Review Boards (IRBs), and facilities that manufacture, process, pack, or hold an FDA-regulated product that is currently marketed. Specifically, the information comes from the electronic inspection tool of FDA Form 483.

In [1]:
# import packages
import gensim
import matplotlib as mpl
import numpy as np
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
import pickle
import re
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# read dataframe into python
df = pd.read_excel("Inspection_Citation_(10-1-2008_through_7-22-2020)_0.xlsx")

In [3]:
df.head()

Unnamed: 0,Firm Name,City,State,Country/Area,Inspection End Date,Program Area,CFR/Act Number,Short Description,Long Description
0,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.20(b)(4),"Floors, walls and ceilings",The plant is not constructed in such a manner ...
1,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.20(b)(5),Safety lighting and glass,Failure to provide safety-type lighting fixtur...
2,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.35(a),Buildings/good repair,Failure to maintain buildings in repair suffic...
3,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.35(a),Cleaning and sanitizing operations,Failure to conduct cleaning and sanitizing ope...
4,"A & M Bakery, Inc.",Clarksburg,WV,United States,2008-10-01,Foods,21 CFR 110.80(a)(1),Storage,Failure to store raw materials in a manner tha...


In [4]:
# tidy dataframe to only contain citations from the Drugs 'Program Area'
## filter for only observations from the Drug program area in USA.
drugs = df[(df['Program Area'] == "Drugs") &
           (df['Country/Area'] == "United States")].reset_index().drop('index', axis = 1)

## tidy up the CFR/Act Number column to generalize to the CFR chapter.
CFR = []
for values in drugs['CFR/Act Number']:
    try:
        x = re.search(r'21 CFR [0-9]+\.[0-9]+', values).group()
    except:
        x = None
    CFR.append(x)
drugs['CFR'] = CFR

# create a series of citations
citations = drugs['Long Description']

In [5]:
drugs.shape

(28379, 10)

## Topic Modeling of Long Descriptions

For each long description in df, extract a list of topics that are most likely to be cited during an FDA audit.

In [6]:
cites = list(citations)
tokens = []
for i in cites:
    token = nltk.word_tokenize(i)
    tokens.append(token)
    
tokens[:1]

[['Routine',
  'calibration',
  'of',
  'mechanical',
  'equipment',
  'is',
  'not',
  'performed',
  'according',
  'to',
  'a',
  'written',
  'program',
  'designed',
  'to',
  'assure',
  'proper',
  'performance',
  '.']]

In [7]:
# determine stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words[:5]
stop_words.extend(['deficient', 'used', 'purporting', 'meet'])
stop_words[:-5:-1]

['meet', 'purporting', 'used', 'deficient']

In [8]:
# Use CountVectorizer to find tokens, remove stop words,
# remove tokens that don't appear in at least 25 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df = 25, max_df = 0.2, ngram_range = (1,2),
                       stop_words = stop_words)

# Fit and transform
X = vect.fit_transform(cites)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns = False)

# Mapping from word IDs to words (TO be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

In [9]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to variable 'ldamodel'

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10,
                                          id2word = id_map, passes = 25,
                                          random_state = 4)

In [10]:
ldamodel.print_topics(10)

[(0,
  '0.043*"quality control" + 0.041*"control unit" + 0.041*"unit" + 0.029*"applicable" + 0.024*"applicable quality" + 0.024*"procedures applicable" + 0.024*"responsibilities procedures" + 0.024*"responsibilities" + 0.023*"purity" + 0.023*"quality purity"'),
 (1,
  '0.028*"established" + 0.017*"may" + 0.016*"material" + 0.016*"process" + 0.016*"procedures established" + 0.016*"manufacturing" + 0.015*"control procedures" + 0.015*"process material" + 0.014*"processes" + 0.014*"responsible"'),
 (2,
  '0.027*"appropriate" + 0.019*"followed" + 0.018*"appropriate intervals" + 0.018*"intervals" + 0.018*"process" + 0.017*"component" + 0.016*"process control" + 0.016*"procedures describing" + 0.016*"describing" + 0.016*"production process"'),
 (3,
  '0.033*"equipment" + 0.022*"processing" + 0.021*"holding" + 0.021*"packing" + 0.020*"processing packing" + 0.020*"manufacture" + 0.020*"holding drug" + 0.020*"packing holding" + 0.020*"cleaning" + 0.019*"manufacture processing"'),
 (4,
  '0.027*"

In [11]:
new_doc = ['Aseptic processing areas are deficient regarding the system for monitoring environmental conditions.']

X = vect.transform(new_doc)
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns = False)
topic_dis = list(ldamodel[corpus])[0]

topic_dis

[(7, 0.9437485)]

In [12]:
ldamodel.log_perplexity(corpus)

  perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words


-1407.9982700665792

In [13]:
#import pyLDAvis.gensim
#from gensim.corpora.dictionary import Dictionary

#id_map = dict((v, k) for k, v in vect.vocabulary_.items())

#pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary) # needs to have a defined dictionary

In [20]:
#topic = []
#for line in cites:
    #X = vect.transform(line)
    #corpus = gensim.matutils.Sparse2Corpus(X, documents_columns = False)
    #topic_dis = list(ldamodel[corpus])[0]
    #topic.append(topic_dis)

#topic[:5]

topic = [cites[0]] # this format for each line of cites allows the topic prediction to work.
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]
print(topic)
print(new_doc)
X = vect.transform(topic)
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns = False)
topic_dis = list(ldamodel[corpus])[0]


# assign each citation to the most probable topic
t2 = []
for line in cites:
    t = [line]
    T = vect.transform(t)
    corpus = gensim.matutils.Sparse2Corpus(T, documents_columns = False)
    T_dis = list(ldamodel[corpus])[0]
    t2.append(max(T_dis))

['Routine calibration of mechanical equipment is not performed according to a written program designed to assure proper performance.']
["\n\nIt's my understanding that the freezing will start to occur because of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge Krumins\n-- "]


28379

In [58]:
# join the predicted topics to the drugs df.
s_t2 = pd.Series(t2)
drugs['topic'] = s_t2.values



drugs['topic'] = drugs['topic'].astype(str)
drugs[['topic_number', 'topic_probability']] = drugs.topic.str.split(',', expand = True)

drugs.head()

Unnamed: 0,Firm Name,City,State,Country/Area,Inspection End Date,Program Area,CFR/Act Number,Short Description,Long Description,CFR,Routine calibration of mechanical equipment is not performed according to a written program designed to assure proper performance.,topic,topic_number,topic_probability
0,AmerisourceBergen Drug Corporation,Des Moines,WA,United States,2008-10-07,Drugs,21 CFR 211.68(a),Calibration/Inspection/Checking not done,Routine calibration of mechanical equipment is...,21 CFR 211.68,,"(3, 0.96086705)",(3,0.96086705)
1,AmerisourceBergen Drug Corporation,Des Moines,WA,United States,2008-10-07,Drugs,21 CFR 211.142(b),Storage under appropriate conditions,Drug products are not stored under appropriate...,21 CFR 211.142,,"(0, 0.9499952)",(0,0.9499952)
2,AmerisourceBergen Drug Corporation,Des Moines,WA,United States,2008-10-07,Drugs,21 CFR 211.204,Returned drug products with doubt cast as to s...,"Returned drug products held, stored or shipped...",21 CFR 211.204,,"(9, 0.32356438)",(9,0.32356438)
3,"IPR Pharmaceuticals, Inc.",Canovanas,PR,United States,2008-10-08,Drugs,21 CFR 211.160(b)(2),"In-process samples representative, identified ...",Samples taken of in-process materials for dete...,21 CFR 211.160,,"(9, 0.32835144)",(9,0.32835144)
4,"IPR Pharmaceuticals, Inc.",Canovanas,PR,United States,2008-10-08,Drugs,21 CFR 211.160(b)(4),Establishment of calibration procedures,Procedures describing the calibration of instr...,21 CFR 211.160,,"(7, 0.6549899)",(7,0.6549899)
