# FDA : FOOD RECALL ENFORCEMENT REPORTS

**FDA considers a recall to be a firm's removal or correction of a marketed product that the FDA considers to be in violation of the laws it administers and against which the agency would initiate legal action.  Separate from determining whether a firm’s action meets the definition of a recall, FDA also classifies a particular product recall to indicate the relative degree of health hazard (class I, II, or III) presented by the product being recalled.  Recalls are categorized in the Enforcement Report as either class I, II, or III or "not yet classified."**<br><br>**All recalls monitored by FDA are included in the Enforcement Report once they are classified and may be listed prior to classification when FDA determines the firm’s removal or correction of a marketed product(s) meets the definition of a recall.  Once FDA completes the hazard assessment, the Enforcement Report entry will be updated with the recall classification.**<br><br>**Instructions for navigating the report and definitions of the report contents are found on the Enforcement Report Navigation and Definitions page, https://www.fda.gov/Safety/Recalls/EnforcementReports/ucm181313.htm.**

|Label|Definition
|---|---|
|Recalling Firm |The firm that initiates a recall
|Classification |Numerical designation (I, II, or III) that is assigned by FDA to a particular product recall that indicates the relative degree of health hazard. For recalls pending classification, the entry will display as “Not Yet Classified”
|Class I|Class I is a situation in which there is a reasonable probability that the use of, or exposure to, a violative product will cause serious adverse health consequences or death
|Class II|Class II is a situation in which use of, or exposure to, a violative product may cause temporary or medically reversible adverse health consequences or where the probability of serious adverse health consequences is remote
|Class III|Class III is a situation in which use of, or exposure to, a violative product is not likely to cause adverse health consequences
|Status|Shows the progress of a recall
|On-Going|A recall which is currently in progress
|Completed|A recall which has reached the point at which the firm has actually retrieved and impounded all outstanding product that could reasonably be expected to be recovered, or has completed all product corrections
|Terminated|A recall where FDA has determined that all reasonable efforts have been made to remove or correct the violative product in accordance with the recall strategy, and proper disposition has been made according to the degree of hazard
|Distribution Pattern|General area of initial distribution such as states, countries, or territories. Note that subsequent distribution by the consignees to other parties may not be included
|Product Description|Brief description of the product
|Code Information|A list of all lot and/or serial numbers, product numbers, expiration dates, sell or use by dates, etc., which appear on the product or its labeling
|Reason for Recall|Information describing how the product is defective
|Product Quantity|The amount of product subject to recall
|Voluntary/Mandated|Designates that a recall was initiated voluntarily by a firm on its own volition or after being requested to recall by FDA.  “Mandatory” designates that a recall was initiated under a mandatory (statuatory) recall authority, a court order, or FDA order.
|Recall Initiation Date|The date that the firm first began notifying the public or their consignees of the recall
|Initial Firm Notification of Consignee or Public|The method(s) by which the firm initially notified the public or their consignees of a recall
|Recall Number|An alphanumeric designation assigned by FDA to a specific, classified recalled product (used for tracking purposes)
|Event ID|A numerical designation assigned by FDA to a specific recall event (used for tracking purposes)
|Center Classification Date|The date that FDA classified the recalled products as Class I, II, or III
|Date Terminated|The date that FDA terminated the recall

**Food Recall Enforcement Reports [/food/enforcement]**<br><br>This endpoint's data may be downloaded in **zipped JSON files.** Records are represented in the same format as API calls to this endpoint. Each update to the data in this endpoint could change old records. You need to download all the files to ensure you have a complete and up-to-date dataset, not just the newest files. For more information about openFDA downloads, see the API basics.

## 1. We import the JSON file and convert it to a pandas dataframe

In [33]:
import pandas as pd
import numpy as np

In [34]:
import json
from pandas.io.json import json_normalize    
with open('food-enforcement-0001-of-0001.json') as data_file:    
    data = json.load(data_file)  
df = json_normalize(data, 'results')

In [35]:
display(df.head(5))

Unnamed: 0,address_1,address_2,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,...,product_type,reason_for_recall,recall_initiation_date,recall_number,recalling_firm,report_date,state,status,termination_date,voluntary_mandated
0,748 S Alameda St,,20120926,Los Angeles,Class II,not available.,United States,California,63150,E-Mail,...,Food,M & K Trading is recalling Korean Molluscan Sh...,20120921,F-2396-2012,M & K Trading Inc,20121003,CA,Terminated,20120926,Voluntary: Firm Initiated
1,4401 Foxdale St,,20120924,Irwindale,Class I,"UPC 7774523746, Use by dates 9/8/2012 or earli...",United States,Nationwide to following US States and Canada: ...,63062,Press Release,...,Food,"Firm is voluntarily recalling, out of an abund...",20120831,F-2382-2012,Ready Pac Foods Inc,20121003,CA,Terminated,20121025,Voluntary: Firm Initiated
2,2315 Moore Ave,,20120927,Fullerton,Class II,"UPC 0-30871-33001-2, Item # 0291710.",United States,Nationwide and Canada.,62991,Letter,...,Food,The firm recalled due to a potential non-safet...,20120712,F-2438-2012,Pulmuone Wildwood Inc,20121003,CA,Terminated,20121217,Voluntary: Firm Initiated
3,2315 Moore Ave,,20120927,Fullerton,Class II,"UPC 0-52334-11659-9, Item # 0291661.",United States,Nationwide and Canada.,62991,Letter,...,Food,The firm recalled due to a potential non-safet...,20120712,F-2436-2012,Pulmuone Wildwood Inc,20121003,CA,Terminated,20121217,Voluntary: Firm Initiated
4,1720 Locust Grove Road,,20120921,Manheim,Class II,"No codes; all product ""Purchase by date shown ...",United States,Product was distributed to specific wholesale ...,62465,Telephone,...,Food,FDA samples of product tested positive for Fum...,20120417,F-2374-2012,Haldeman Mills,20121003,PA,Terminated,20130716,Voluntary: Firm Initiated


## 2. Missing Values 

In [36]:
null_cols = df.isnull().sum()
print(null_cols)

address_1                         0
address_2                         0
center_classification_date        7
city                              0
classification                    0
code_info                         0
country                           0
distribution_pattern              0
event_id                          0
initial_firm_notification         0
more_code_info                16690
openfda                           0
postal_code                       0
product_description               0
product_quantity                  0
product_type                      0
reason_for_recall                 0
recall_initiation_date            0
recall_number                     0
recalling_firm                    0
report_date                       0
state                             0
status                            0
termination_date               1772
voluntary_mandated                0
dtype: int64


In [37]:
#Some problems arised identifying nulls in address_2 as isnull() does not identify the field as null
print(df['address_2'][8])
print(df['address_1'][8])
sum(df['address_2'] =="")
sum(df['address_2'] !="")

Beloit, WI 53511 USA
3400 Millington RD


797

**As we have postal_code and State Information maybe the address_1 and address_2 columns can be dropped as
they are not important for further analysis**


In [38]:
drop_cols = list(df[['address_1','address_2']])
df= df.drop(drop_cols,axis=1)

In [39]:
print(list(df))

['center_classification_date', 'city', 'classification', 'code_info', 'country', 'distribution_pattern', 'event_id', 'initial_firm_notification', 'more_code_info', 'openfda', 'postal_code', 'product_description', 'product_quantity', 'product_type', 'reason_for_recall', 'recall_initiation_date', 'recall_number', 'recalling_firm', 'report_date', 'state', 'status', 'termination_date', 'voluntary_mandated']


**In addition the column more_code_info does not have any value for nearly all records in our dataset. 
We can also drop this column**

In [40]:
df = df.drop('more_code_info',axis = 1)

In [41]:
print(list(df))

['center_classification_date', 'city', 'classification', 'code_info', 'country', 'distribution_pattern', 'event_id', 'initial_firm_notification', 'openfda', 'postal_code', 'product_description', 'product_quantity', 'product_type', 'reason_for_recall', 'recall_initiation_date', 'recall_number', 'recalling_firm', 'report_date', 'state', 'status', 'termination_date', 'voluntary_mandated']


***Finally, we have the openfda column filled dictionary type {} but no content. This does not make a lot of sense for analysis purposes. We can also drop this column***

In [42]:
from collections import Counter

In [43]:
Counter (df['openfda'] == {})

Counter({True: 16691})

In [44]:
df = df.drop('openfda',axis = 1)
print(list(df))

['center_classification_date', 'city', 'classification', 'code_info', 'country', 'distribution_pattern', 'event_id', 'initial_firm_notification', 'postal_code', 'product_description', 'product_quantity', 'product_type', 'reason_for_recall', 'recall_initiation_date', 'recall_number', 'recalling_firm', 'report_date', 'state', 'status', 'termination_date', 'voluntary_mandated']


***We have 1772 nulls in termination_date and 7 in center_classification_date. Checking for errors***: <br>
1. We would expect an ongoing status for those records without termination_date. Let's check that it is met<br>
2. We would also expect that the 7 records without center_classification_date should have a ***'Not Yet Classified'*** classification value 


In [45]:
# Select status where the termination_date is missing 
df['status'][df['termination_date'].isnull()].value_counts()

Ongoing       1594
Completed      138
Terminated      40
Name: status, dtype: int64

**We have 1594 cases that meet what we expected so we can conclude these are not errors and we don't need to fill in the null records. On the other hand, we get 178 cases that don't meet our hypothesis (138 cases have a completed status and 40 are considered terminated). Let's check what's happening!!**


In [46]:
# Let's see some of these records 
completed_inconsistencies= df[df['termination_date'].isnull() & (df['status']=='Completed')]
print(len(completed_inconsistencies))
display(completed_inconsistencies.head())

138


Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,postal_code,product_description,...,product_type,reason_for_recall,recall_initiation_date,recall_number,recalling_firm,report_date,state,status,termination_date,voluntary_mandated
70,20160323,Solon,Class II,Production codes and Best Before dates: 1....,United States,Nationwide,73500,Telephone,44139-2205,Stouffer's vegetable lasagna. Keep Frozen. So...,...,Food,Nestle is recalling a limited number of DiGior...,20160310,F-0792-2016,Nestle USA,20160330,OH,Completed,,Voluntary: Firm Initiated
71,20160323,Solon,Class II,Production codes: 5348587812 and 5349587812. ...,United States,Nationwide,73500,Telephone,44139-2205,Lean Cuisine Marketplace Mushroom Mezzaluna Ra...,...,Food,Nestle is recalling a limited number of DiGior...,20160310,F-0789-2016,Nestle USA,20160330,OH,Completed,,Voluntary: Firm Initiated
73,20160323,Solon,Class II,"Batch code: 6004525932, 6005525931, 602052593...",United States,Nationwide,73500,Telephone,44139-2205,DiGiorno Rising Crust Spinach and Mushroom Piz...,...,Food,Nestle is recalling a limited number of DiGior...,20160310,F-0785-2016,Nestle USA,20160330,OH,Completed,,Voluntary: Firm Initiated
943,20160809,San Antonio,Class II,BETTER BY 12/27/17,United States,Products was distributed to the southern midwe...,74374,Telephone,78204-1402,"Williams Chipotle Chili Seasoning, Net Wt. 1 1...",...,Food,Chili and taco seasoning products contain an i...,20160608,F-1918-2016,"CH Guenther & Sons, Inc",20160817,TX,Completed,,Voluntary: Firm Initiated
1084,20160809,San Antonio,Class II,BETTER BY 01/27/18; BETTER BY 04/11/18,United States,Products was distributed to the southern midwe...,74374,Telephone,78204-1402,"Williams Chipotle Taco Seasoning, Net Wt. 1 1/...",...,Food,Chili and taco seasoning products contain an i...,20160608,F-1919-2016,"CH Guenther & Sons, Inc",20160817,TX,Completed,,Voluntary: Firm Initiated


In [47]:
terminated_inconsistencies= df[df['termination_date'].isnull() & (df['status']=='Terminated')]
print(len(terminated_inconsistencies))
display(terminated_inconsistencies.head())

40


Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,postal_code,product_description,...,product_type,reason_for_recall,recall_initiation_date,recall_number,recalling_firm,report_date,state,status,termination_date,voluntary_mandated
2871,20170111,Charleroi,Class II,Double Takes: Best By: 10/11/17 10/12/17 10/13...,United States,CA CT GA IL MS OK PA SC VA WA,75911,Telephone,15022-1060,Cardboard sleeve: Double Takes Macaroni & Chee...,...,Food,Fourth Street Barbecue Inc./ Packing Division ...,20161209,F-1234-2017,"Fourth Street Barbeque, Inc.",20170118,PA,Terminated,,Voluntary: Firm Initiated
3020,20160316,Chicago,Class III,Production date: 08/19/2015; Best used by date...,United States,All product was distributed in the state of Fl...,72717,"Two or more of the following: Email, Fax, Lett...",60642-4205,Swai Fillets ( 7-9 oz.) packaged in a 15 lb. b...,...,Food,Nitrofuran (SCA) was found in the product duri...,20151104,F-0579-2016,Restaurant Depot/Jetro,20160323,IL,Terminated,,Voluntary: Firm Initiated
3176,20170217,Sioux City,Class I,Exp 11/28/2017,United States,Product distributed to grocery stores and groc...,76154,"Two or more of the following: Email, Fax, Lett...",51105-2444,Palmer's Candies Chocolatey Flavored NP Heart ...,...,Food,Product contains an ingredient that was recall...,20170109,F-1511-2017,"Palmer and Company, dba Palmer Candy Co",20170301,IA,Terminated,,Voluntary: Firm Initiated
3245,20170217,Sioux City,Class I,Exp 7/18/2017,United States,Product distributed to grocery stores and groc...,76154,"Two or more of the following: Email, Fax, Lett...",51105-2444,Palmer's Candies Game Day Party Bowl NET WT 16...,...,Food,Product contains an ingredient that was recall...,20170109,F-1500-2017,"Palmer and Company, dba Palmer Candy Co",20170301,IA,Terminated,,Voluntary: Firm Initiated
3248,20170217,Sioux City,Class I,"Exp 8/30/2017, 9/7/2017",United States,Product distributed to grocery stores and groc...,76154,"Two or more of the following: Email, Fax, Lett...",51105-2444,Palmer's Candies Swirled Pretzels NET WT 5 OZ ...,...,Food,Product contains an ingredient that was recall...,20170109,F-1504-2017,"Palmer and Company, dba Palmer Candy Co",20170301,IA,Terminated,,Voluntary: Firm Initiated


In [48]:
#As we have 178 records over 16691 (1%) we could remove this records from the final analysis file. 
# We create a new dataframe without these records. 
df = df.loc[~((df['termination_date'].isnull()) & (df.status =='Terminated')| (df.status == 'Completed')),:]
display(df.head())
print(len(df))

Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,postal_code,product_description,...,product_type,reason_for_recall,recall_initiation_date,recall_number,recalling_firm,report_date,state,status,termination_date,voluntary_mandated
0,20120926,Los Angeles,Class II,not available.,United States,California,63150,E-Mail,90021-1616,"Seasoned Clams, 240 grams",...,Food,M & K Trading is recalling Korean Molluscan Sh...,20120921,F-2396-2012,M & K Trading Inc,20121003,CA,Terminated,20120926,Voluntary: Firm Initiated
1,20120924,Irwindale,Class I,"UPC 7774523746, Use by dates 9/8/2012 or earli...",United States,Nationwide to following US States and Canada: ...,63062,Press Release,91706-2161,"Ready Pac¿ Sliced Mango, 10.5oz , UPC 77745237...",...,Food,"Firm is voluntarily recalling, out of an abund...",20120831,F-2382-2012,Ready Pac Foods Inc,20121003,CA,Terminated,20121025,Voluntary: Firm Initiated
2,20120927,Fullerton,Class II,"UPC 0-30871-33001-2, Item # 0291710.",United States,Nationwide and Canada.,62991,Letter,92833-2510,"Wildwood Organic Mild Salsa , 14 oz, Pack size...",...,Food,The firm recalled due to a potential non-safet...,20120712,F-2438-2012,Pulmuone Wildwood Inc,20121003,CA,Terminated,20121217,Voluntary: Firm Initiated
3,20120927,Fullerton,Class II,"UPC 0-52334-11659-9, Item # 0291661.",United States,Nationwide and Canada.,62991,Letter,92833-2510,Wildwood Emerald Valley Kitchen Organic Medium...,...,Food,The firm recalled due to a potential non-safet...,20120712,F-2436-2012,Pulmuone Wildwood Inc,20121003,CA,Terminated,20121217,Voluntary: Firm Initiated
4,20120921,Manheim,Class II,"No codes; all product ""Purchase by date shown ...",United States,Product was distributed to specific wholesale ...,62465,Telephone,17545-9639,"Bulk Foods, Inc. Yellow Corn Meal, Regular Roa...",...,Food,FDA samples of product tested positive for Fum...,20120417,F-2374-2012,Haldeman Mills,20121003,PA,Terminated,20130716,Voluntary: Firm Initiated


16513


In [49]:
# Select classification where the center_classification_date is missing 
df['classification'][df['center_classification_date'].isnull()].value_counts()

Not Yet Classified    6
Name: classification, dtype: int64

***Our hypothesis is met so we can conclude that center_classification_date missing values are consistent with their classification. One row has been removed because matched the above condition on termination_date***

## 2. Data types Correction

In [50]:
# We set identified numeric variables as date type ones
df['event_id'] = df['event_id'].astype(int)

*** We have two more potential numeric columns postal_code and recall number.***
<br>
*** The postal code is only available for United States recalls. The first part of the ZIP code (90021) corresponds to the city(Los Angeles), information that we have already available in city column. The second part of the ZIP Code (1616) corresponds, more or less, to the county.***
<br>
***We can split this column in two: city_code and county_code for***
<br>
    *** The recall number has the following structure, F-2396-2012 where the first part means product_type, the second the number assigned to the recall, and the third part is the year of the recall_initiation_date***  

In [51]:
# We leave the postal_code split new columns as strings 
df['city_code'] = df.postal_code.str.split('-').str.get(0)
df['county_code'] = df.postal_code.str.split('-').str.get(1)

In [52]:
# We select all df columns except the postal_code we have just 'split'
df = df.loc[:, df.columns != 'postal_code']

In [53]:
print(len(df))
print(list(df))

16513
['center_classification_date', 'city', 'classification', 'code_info', 'country', 'distribution_pattern', 'event_id', 'initial_firm_notification', 'product_description', 'product_quantity', 'product_type', 'reason_for_recall', 'recall_initiation_date', 'recall_number', 'recalling_firm', 'report_date', 'state', 'status', 'termination_date', 'voluntary_mandated', 'city_code', 'county_code']


***We extract the number in the second position to a new column. We leave as string as there are records with registration errors.***

In [54]:
df['modified_recall_number'] = df.recall_number.str.split('-').str[1]
df['modified_recall_number_bis'] = df.recall_number.str.split('-').str[0]
df['modified_recall_number'].fillna(df['modified_recall_number_bis'], inplace=True)
print(df['modified_recall_number'].head(3))
print(Counter(df['modified_recall_number'].isnull()))
display(df.head())

0    2396
1    2382
2    2438
Name: modified_recall_number, dtype: object
Counter({False: 16513})


Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,product_description,product_quantity,...,recalling_firm,report_date,state,status,termination_date,voluntary_mandated,city_code,county_code,modified_recall_number,modified_recall_number_bis
0,20120926,Los Angeles,Class II,not available.,United States,California,63150,E-Mail,"Seasoned Clams, 240 grams",39 units,...,M & K Trading Inc,20121003,CA,Terminated,20120926,Voluntary: Firm Initiated,90021,1616,2396,F
1,20120924,Irwindale,Class I,"UPC 7774523746, Use by dates 9/8/2012 or earli...",United States,Nationwide to following US States and Canada: ...,63062,Press Release,"Ready Pac¿ Sliced Mango, 10.5oz , UPC 77745237...",1544 cases,...,Ready Pac Foods Inc,20121003,CA,Terminated,20121025,Voluntary: Firm Initiated,91706,2161,2382,F
2,20120927,Fullerton,Class II,"UPC 0-30871-33001-2, Item # 0291710.",United States,Nationwide and Canada.,62991,Letter,"Wildwood Organic Mild Salsa , 14 oz, Pack size...",602 units,...,Pulmuone Wildwood Inc,20121003,CA,Terminated,20121217,Voluntary: Firm Initiated,92833,2510,2438,F
3,20120927,Fullerton,Class II,"UPC 0-52334-11659-9, Item # 0291661.",United States,Nationwide and Canada.,62991,Letter,Wildwood Emerald Valley Kitchen Organic Medium...,575 units,...,Pulmuone Wildwood Inc,20121003,CA,Terminated,20121217,Voluntary: Firm Initiated,92833,2510,2436,F
4,20120921,Manheim,Class II,"No codes; all product ""Purchase by date shown ...",United States,Product was distributed to specific wholesale ...,62465,Telephone,"Bulk Foods, Inc. Yellow Corn Meal, Regular Roa...",,...,Haldeman Mills,20121003,PA,Terminated,20130716,Voluntary: Firm Initiated,17545,9639,2374,F


In [55]:
# We select all df columns except the recall_number we have just 'split'
df = df.loc[:, (df.columns != 'recall_number') & (df.columns != 'modified_recall_number_bis')]
print(df.dtypes)
display(df.tail())

center_classification_date    object
city                          object
classification                object
code_info                     object
country                       object
distribution_pattern          object
event_id                       int64
initial_firm_notification     object
product_description           object
product_quantity              object
product_type                  object
reason_for_recall             object
recall_initiation_date        object
recalling_firm                object
report_date                   object
state                         object
status                        object
termination_date              object
voluntary_mandated            object
city_code                     object
county_code                   object
modified_recall_number        object
dtype: object


Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,product_description,product_quantity,...,recall_initiation_date,recalling_firm,report_date,state,status,termination_date,voluntary_mandated,city_code,county_code,modified_recall_number
16686,20180828,Albany,Class II,"Use By Dates: 8/22/2018, 8/23/2018, 8/24/2018,...",United States,"Firm distributes to wholesale accounts, retail...",80889,Telephone,"Mexican Style Ricotta Cheese labeled, ""Don Fro...",1357 total units (1330 units of 1 lb.; 3 bags ...,...,20180820,Ochoa's Queseria LLC,20180905,OR,Terminated,20180918.0,Voluntary: Firm Initiated,97321,2705,1908
16687,20180824,Austin,Class II,"May 29, 2018 [E2918 batch code on can]; UPC: 7...",United States,Domestic: AR Foreign/VA/DOD: None,80806,"Two or more of the following: Email, Fax, Lett...","Tomato Condensed Soup, 26 oz (1 LB 10 OZ) 737 ...","6,168 cans",...,20180808,"Morgan Foods, Inc.",20180905,IN,Ongoing,,Voluntary: Firm Initiated,47102,1741,1855
16688,20180828,Aurora,Class II,,United States,distributed in Oregon,80816,Telephone,Mustard was shipped in card board boxes that a...,2 dozens,...,20180813,Howard Calcagno Farms,20180905,OR,Ongoing,,Voluntary: Firm Initiated,97002,8316,1912
16689,20180828,Aurora,Class II,,United States,distributed in Oregon,80816,Telephone,Dill was shipped in card board boxes that are ...,6 ct,...,20180813,Howard Calcagno Farms,20180905,OR,Ongoing,,Voluntary: Firm Initiated,97002,8316,1927
16690,20180904,Portland,Class II,Use by date Sept 9/18,United States,distributed in Oregon and Washington,80897,Telephone,"Sammy Salsa Smok'n Chipotle, packaged in 14 oz...",17 cases,...,20180815,Sammy Food Products LLC,20180912,OR,Terminated,20180919.0,Voluntary: Firm Initiated,97229,5348,1951


In [56]:
#We set variables as categorical type
df['classification'] = df.classification.astype('category')
df['status']= df.status.astype('category')
df['product_type'] = df.product_type.astype('category')
df['voluntary_mandated']=df.voluntary_mandated.astype('category')

*** The column initial_firm_notification could take the following values, 2 of them could cause errors in further steps. <br> The value Two or more of the following..... will be replaced by two_or_more and the one with a blank class will be replaced by notavailable.***

In [57]:
# Before setting initial firm notification as a categorical variable we are going to replace 
print(Counter(df['initial_firm_notification']))

Counter({'Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit': 6023, 'Letter': 3704, 'Press Release': 2342, 'E-Mail': 2118, 'Telephone': 1880, 'Visit': 239, 'Other': 140, 'FAX': 63, '': 4})


In [58]:
mask = df.initial_firm_notification == ''
column_name = 'initial_firm_notification'
df.loc[mask, column_name] = 'NA'

In [59]:
print(Counter(df['initial_firm_notification']))

Counter({'Two or more of the following: Email, Fax, Letter, Press Release, Telephone, Visit': 6023, 'Letter': 3704, 'Press Release': 2342, 'E-Mail': 2118, 'Telephone': 1880, 'Visit': 239, 'Other': 140, 'FAX': 63, 'NA': 4})


In [60]:
mask = df.initial_firm_notification.str.contains('following',na=False)
column_name = 'initial_firm_notification'
df.loc[mask,column_name] = 'Two_or_More'

In [61]:
print(Counter(df['initial_firm_notification']))

Counter({'Two_or_More': 6023, 'Letter': 3704, 'Press Release': 2342, 'E-Mail': 2118, 'Telephone': 1880, 'Visit': 239, 'Other': 140, 'FAX': 63, 'NA': 4})


In [62]:
# We set the variable as categorical 
df['initial_firm_notification'] = df.initial_firm_notification.astype('category')

In [63]:
print (Counter(df['classification']))
print(Counter(df['status']))
print(Counter(df['product_type']))
print(Counter(df['voluntary_mandated']))

Counter({'Class II': 8284, 'Class I': 7247, 'Class III': 976, 'Not Yet Classified': 6})
Counter({'Terminated': 14918, 'Ongoing': 1595})
Counter({'Food': 16513})
Counter({'Voluntary: Firm Initiated': 16202, 'FDA Mandated': 309, 'Voluntary: FDA Requested': 2})


In [64]:
# We set identified datetime variables as date type ones
df['center_classification_date'] = pd.to_datetime(df.center_classification_date, format="%Y/%m/%d")
df['report_date'] = pd.to_datetime(df.report_date, format="%Y/%m/%d")
df['termination_date'] = pd.to_datetime(df.termination_date, format="%Y/%m/%d")
df['recall_initiation_date'] = pd.to_datetime(df.recall_initiation_date, format="%Y/%m/%d")

ValueError: time data 02121207 doesn't match format specified

## 3. Incorrect Values 

In [65]:
# We get an incorret value in recall_initiation_date and the data type cannot be changed. We will replace for a NaN
#df['recall_initiation_date'] = np.where(df['recall_initiation_date'] == '02121207', '','')
df['recall_initiation_date'] = df['recall_initiation_date'].replace('02121207', '')
Counter(df['recall_initiation_date']. isnull())

Counter({False: 16513})

In [66]:
df.loc[df['recall_initiation_date'] =='',:]

Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,product_description,product_quantity,...,recall_initiation_date,recalling_firm,report_date,state,status,termination_date,voluntary_mandated,city_code,county_code,modified_recall_number
7956,2013-01-14,Englewood Cliffs,Class II,Sept1113BUO Sept1213BUO,United States,Nationwide,63851,Letter,Knorr Pasta Sides Cheesy Bacon Macaroni Net Wt...,"16,500 puches",...,,"Unilever United States, Inc.",2013-01-23,NJ,Terminated,2015-01-16,Voluntary: Firm Initiated,7632,3113,880


In [67]:
#Now we can convert 'recall_initiation_date' to datetime data type
df['recall_initiation_date'] = pd.to_datetime(df.recall_initiation_date, format="%Y/%m/%d")

In [68]:
print(df.dtypes)

center_classification_date    datetime64[ns]
city                                  object
classification                      category
code_info                             object
country                               object
distribution_pattern                  object
event_id                               int64
initial_firm_notification           category
product_description                   object
product_quantity                      object
product_type                        category
reason_for_recall                     object
recall_initiation_date        datetime64[ns]
recalling_firm                        object
report_date                   datetime64[ns]
state                                 object
status                              category
termination_date              datetime64[ns]
voluntary_mandated                  category
city_code                             object
county_code                           object
modified_recall_number                object
dtype: obj

## 4. Some additional stuff 

***The product_type column only takes value "Food", the product description is a no-structured field and it's difficult to extract if the product is fruit, salad, meat,etc... in order to extract some patterns from the product quantity column*** 

In [69]:
print(Counter(df['product_type']))
display(df['product_quantity'][:20])

Counter({'Food': 16513})


0                               39 units
1                             1544 cases
2                              602 units
3                              575 units
4                                       
5                              200 units
6     7 cases by GFS distribution center
7                                Unknown
8                        118/20-kg. bags
9                                Unknown
10                              96 cakes
11                               11 pies
12                          19,060 cases
13                                    22
14                               40 pies
15                             138 cakes
16                             748 units
17                            1034 units
18                                      
19                      will be provided
Name: product_quantity, dtype: object

In [70]:
display(df['product_description'][:20])

0                             Seasoned Clams, 240 grams
1     Ready Pac¿ Sliced Mango, 10.5oz , UPC 77745237...
2     Wildwood Organic Mild Salsa , 14 oz, Pack size...
3     Wildwood Emerald Valley Kitchen Organic Medium...
4     Bulk Foods, Inc. Yellow Corn Meal, Regular Roa...
5     Whey Protein Isolate Cold-Filtration, Net Weig...
6     Pistachios, Shelled Raw, 4 /2.5 Packed by Trop...
7     Sadie's Salads Macaroni and Cheese, 5 lb and 1...
8     Kerry Organic Pure 900, Organic Soy Protein Is...
9     Sadie's Salads Cole Slaw, 5 lb, 10 lb and 30 l...
10         McClure's Pies & Salads, Inc. Chocolate Cake
11              McClure's Pies & Salads, Inc. Pecan Pie
12    Starbucks¿ Seasonal Harvest Fruit Blend, 6oz, ...
13            McClure's Pies & Salads, Inc. Orange Cake
14              McClure's Pies & Salads, Inc. Apple Pie
15    McClure's Pies & Salads, Inc. Apple Cinnamon S...
16    Wildwood Organic Medium Salsa , 14 oz, Pack si...
17    Wawa pineapple Net Wt. 7 oz (198 g) rigid 

In [71]:
#We drop the three columns, product type, product description and product quantity

In [72]:
df = df.drop(['product_type', 'product_quantity','product_description'],axis = 1)
print(list(df))

['center_classification_date', 'city', 'classification', 'code_info', 'country', 'distribution_pattern', 'event_id', 'initial_firm_notification', 'reason_for_recall', 'recall_initiation_date', 'recalling_firm', 'report_date', 'state', 'status', 'termination_date', 'voluntary_mandated', 'city_code', 'county_code', 'modified_recall_number']


In [73]:
display(df.head())

Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,reason_for_recall,recall_initiation_date,recalling_firm,report_date,state,status,termination_date,voluntary_mandated,city_code,county_code,modified_recall_number
0,2012-09-26,Los Angeles,Class II,not available.,United States,California,63150,E-Mail,M & K Trading is recalling Korean Molluscan Sh...,2012-09-21,M & K Trading Inc,2012-10-03,CA,Terminated,2012-09-26,Voluntary: Firm Initiated,90021,1616,2396
1,2012-09-24,Irwindale,Class I,"UPC 7774523746, Use by dates 9/8/2012 or earli...",United States,Nationwide to following US States and Canada: ...,63062,Press Release,"Firm is voluntarily recalling, out of an abund...",2012-08-31,Ready Pac Foods Inc,2012-10-03,CA,Terminated,2012-10-25,Voluntary: Firm Initiated,91706,2161,2382
2,2012-09-27,Fullerton,Class II,"UPC 0-30871-33001-2, Item # 0291710.",United States,Nationwide and Canada.,62991,Letter,The firm recalled due to a potential non-safet...,2012-07-12,Pulmuone Wildwood Inc,2012-10-03,CA,Terminated,2012-12-17,Voluntary: Firm Initiated,92833,2510,2438
3,2012-09-27,Fullerton,Class II,"UPC 0-52334-11659-9, Item # 0291661.",United States,Nationwide and Canada.,62991,Letter,The firm recalled due to a potential non-safet...,2012-07-12,Pulmuone Wildwood Inc,2012-10-03,CA,Terminated,2012-12-17,Voluntary: Firm Initiated,92833,2510,2436
4,2012-09-21,Manheim,Class II,"No codes; all product ""Purchase by date shown ...",United States,Product was distributed to specific wholesale ...,62465,Telephone,FDA samples of product tested positive for Fum...,2012-04-17,Haldeman Mills,2012-10-03,PA,Terminated,2013-07-16,Voluntary: Firm Initiated,17545,9639,2374


*** The column Voluntary/ Mandated contains 2 main classes, voluntary and mandated. The voluntary class has 2 sub-classes: Firm Initiated and FDA Requested<br>We only have 2 voluntary:FDA requested records, this subclassification provides little additional information for further analysis.<br>We will clean this column to leave only 2 classes, Voluntary and Mandated***

In [74]:
print(Counter(df['voluntary_mandated']))

Counter({'Voluntary: Firm Initiated': 16202, 'FDA Mandated': 309, 'Voluntary: FDA Requested': 2})


In [75]:
df['voluntary_mandated'] = df.voluntary_mandated.str.split(':').str.get(0)
print(Counter(df['voluntary_mandated']))
display(df.head())

Counter({'Voluntary': 16204, 'FDA Mandated': 309})


Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,reason_for_recall,recall_initiation_date,recalling_firm,report_date,state,status,termination_date,voluntary_mandated,city_code,county_code,modified_recall_number
0,2012-09-26,Los Angeles,Class II,not available.,United States,California,63150,E-Mail,M & K Trading is recalling Korean Molluscan Sh...,2012-09-21,M & K Trading Inc,2012-10-03,CA,Terminated,2012-09-26,Voluntary,90021,1616,2396
1,2012-09-24,Irwindale,Class I,"UPC 7774523746, Use by dates 9/8/2012 or earli...",United States,Nationwide to following US States and Canada: ...,63062,Press Release,"Firm is voluntarily recalling, out of an abund...",2012-08-31,Ready Pac Foods Inc,2012-10-03,CA,Terminated,2012-10-25,Voluntary,91706,2161,2382
2,2012-09-27,Fullerton,Class II,"UPC 0-30871-33001-2, Item # 0291710.",United States,Nationwide and Canada.,62991,Letter,The firm recalled due to a potential non-safet...,2012-07-12,Pulmuone Wildwood Inc,2012-10-03,CA,Terminated,2012-12-17,Voluntary,92833,2510,2438
3,2012-09-27,Fullerton,Class II,"UPC 0-52334-11659-9, Item # 0291661.",United States,Nationwide and Canada.,62991,Letter,The firm recalled due to a potential non-safet...,2012-07-12,Pulmuone Wildwood Inc,2012-10-03,CA,Terminated,2012-12-17,Voluntary,92833,2510,2436
4,2012-09-21,Manheim,Class II,"No codes; all product ""Purchase by date shown ...",United States,Product was distributed to specific wholesale ...,62465,Telephone,FDA samples of product tested positive for Fum...,2012-04-17,Haldeman Mills,2012-10-03,PA,Terminated,2013-07-16,Voluntary,17545,9639,2374


In [76]:
mask = df.voluntary_mandated == 'FDA Mandated'
column_name = 'voluntary_mandated'
df.loc[mask, column_name] = 'Mandated'

In [77]:
print(Counter(df['voluntary_mandated']))

Counter({'Voluntary': 16204, 'Mandated': 309})


*** In order to use some of the variables as product description, recall_reason or code_info it would necessary to apply NLP modellization to extract relevant information for further analysis.***

*** As we do not have numeric values in this dataset, It doesn't make too much sense to evaluate low variance columns and outliers<br> We save the final dataframe to csv file after checking *** 

In [78]:
print(list(df))
print(df.info())
display(df.head())

['center_classification_date', 'city', 'classification', 'code_info', 'country', 'distribution_pattern', 'event_id', 'initial_firm_notification', 'reason_for_recall', 'recall_initiation_date', 'recalling_firm', 'report_date', 'state', 'status', 'termination_date', 'voluntary_mandated', 'city_code', 'county_code', 'modified_recall_number']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16513 entries, 0 to 16690
Data columns (total 19 columns):
center_classification_date    16507 non-null datetime64[ns]
city                          16513 non-null object
classification                16513 non-null category
code_info                     16513 non-null object
country                       16513 non-null object
distribution_pattern          16513 non-null object
event_id                      16513 non-null int64
initial_firm_notification     16513 non-null category
reason_for_recall             16513 non-null object
recall_initiation_date        16512 non-null datetime64[ns]
recalling_f

Unnamed: 0,center_classification_date,city,classification,code_info,country,distribution_pattern,event_id,initial_firm_notification,reason_for_recall,recall_initiation_date,recalling_firm,report_date,state,status,termination_date,voluntary_mandated,city_code,county_code,modified_recall_number
0,2012-09-26,Los Angeles,Class II,not available.,United States,California,63150,E-Mail,M & K Trading is recalling Korean Molluscan Sh...,2012-09-21,M & K Trading Inc,2012-10-03,CA,Terminated,2012-09-26,Voluntary,90021,1616,2396
1,2012-09-24,Irwindale,Class I,"UPC 7774523746, Use by dates 9/8/2012 or earli...",United States,Nationwide to following US States and Canada: ...,63062,Press Release,"Firm is voluntarily recalling, out of an abund...",2012-08-31,Ready Pac Foods Inc,2012-10-03,CA,Terminated,2012-10-25,Voluntary,91706,2161,2382
2,2012-09-27,Fullerton,Class II,"UPC 0-30871-33001-2, Item # 0291710.",United States,Nationwide and Canada.,62991,Letter,The firm recalled due to a potential non-safet...,2012-07-12,Pulmuone Wildwood Inc,2012-10-03,CA,Terminated,2012-12-17,Voluntary,92833,2510,2438
3,2012-09-27,Fullerton,Class II,"UPC 0-52334-11659-9, Item # 0291661.",United States,Nationwide and Canada.,62991,Letter,The firm recalled due to a potential non-safet...,2012-07-12,Pulmuone Wildwood Inc,2012-10-03,CA,Terminated,2012-12-17,Voluntary,92833,2510,2436
4,2012-09-21,Manheim,Class II,"No codes; all product ""Purchase by date shown ...",United States,Product was distributed to specific wholesale ...,62465,Telephone,FDA samples of product tested positive for Fum...,2012-04-17,Haldeman Mills,2012-10-03,PA,Terminated,2013-07-16,Voluntary,17545,9639,2374


In [79]:
df.to_csv('fda_enforcement_reports.csv')