# Notebook 7: Manual Variable Selection
During the course of this project, the FAO revised their website to include variable descriptions for each of the food items in the crop and livestock data. Inspecting these descriptions revealed to us that many of the food items overlapped and several were incorrectly defined. Clearly, using these food items will completely confound our results for our statistical models. Here, we manually check and eliminate all the variables that are redundant, incorrectly defined, or completely ambiguous in content. 

In [12]:
# Import packages
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
import pickle
from collections import Counter
from sklearn.preprocessing import Imputer
%matplotlib inline

### Removing Redundant Variables
We found Metadata on the dataset through the FAO website. We will analyze what subcategories each variable contains, and drop some of the variables whose subcategories overlap with those of other variables. This should help to reduce the multicollinearity of the final model.

In [13]:
# import variable descriptions
livestock_desc = pd.read_csv("data/metadata/var_desc_livestock.csv")
crops_desc = pd.read_csv("data/metadata/var_desc_crops.csv")

In [14]:
# Load cleaned NCD data from pickled files
out = open('data/clean/var_desc_livestock.p', 'rb')
livestock_desc = pickle.load(out)
out.close()
out = open('data/clean/var_desc_crops.p', 'rb')
crops_desc = pickle.load(out)
out.close

# load cleaned food data
out = open('data/imputed/food_1970_2000_cleaned.p', 'r')
food_1970_2000_cleaned = pickle.load(out)

In [15]:
livestock_desc['Description'] = livestock_desc['Description'].astype('string')
crops_desc['Description'] = crops_desc['Description'].astype('string')

In [16]:
livestock_desc.head()

Unnamed: 0,Item Code,Item,Description,HS Code,HS07 Code,HS12 Code,CPC Code
0,2946,Animal fats,,,,,
1,2941,Animal Products,,,,,
2,2769,"Aquatic Animals, Others","Default composition: 1587 Aqutc Anim F, 1588 A...",,,,
3,2775,Aquatic Plants,"Default composition: 1594 Aquatic plants, fres...",,,,
4,2961,"Aquatic Products, Other",,,,,


In [17]:
crops_desc.head()

Unnamed: 0,Item Code,Item,Description,HS Code,HS07 Code,HS12 Code,CPC Code
0,2924,Alcoholic Beverages,,,,,
1,2617,Apples and products,"Default composition: 515 Apples, 518 Juice, ap...",,,,
2,2615,Bananas,Default composition: 486 Bananas,,,,
3,2513,Barley and products,"Default composition: 44 Barley, 45 Barley, pot...",,,,
4,2546,Beans,"Default composition: 176 Beans, dry",,,,


We would like to assess whether multiple items contain a given code in their descriptions, which would result in double counting that code.

In [18]:
# create counters for each code in the description
livestock_counter = Counter()
crops_counter = Counter()

# map each item to the codes it contains
livestock_dict = {}
crops_dict = {}

# go through all descriptions in livestock
for index, val in enumerate(livestock_desc['Description']):
    ingredients = [int(num) for num in val.split() if num.isdigit()]
    # check that the list is not empty
    if ingredients:
        item = livestock_desc.iloc[index, :]['Item']
        for ingredient in ingredients:
            livestock_counter[ingredient] += 1
        livestock_dict[item] = ingredients
    
# go through all description in crops
for index, val in enumerate(crops_desc['Description']):
    ingredients = [int(num) for num in val.split() if num.isdigit()]
    # check that the list is not empty
    if ingredients:
        item = crops_desc.iloc[index, :]['Item']
        for ingredient in ingredients:
            crops_counter[ingredient] += 1
        crops_dict[item] = ingredients

# store redudant livestock codes
redundant_livestock_codes = []
# store redundant crop codes
redundant_crop_codes = []

# get redundant item codes for livestock and crops
print "Redundant Livestock codes:"
for key, value in livestock_counter.items():
    if value > 1:
        print key
        redundant_livestock_codes.append(key)

print "Redundant Crop codes:"
for key, value in crops_counter.items():
    if value > 1:
        print key
        redundant_crop_codes.append(key)

Redundant Livestock codes:
Redundant Crop codes:
27
33
35
154
155
160
161
162
163
165
166
167
172
173
242
567
568


In [19]:
for key, value in crops_dict.items():
    for code in value:
        if code in redundant_crop_codes:
            print key + ": " + str(code)

Vegetables, Other: 567
Vegetables, Other: 568
Fruits, Other: 567
Fruits, Other: 568
Sugar, Refined Equiv: 162
Sugar (Raw Equivalent): 162
Groundnuts (in Shell Eq): 242
Sugar, Raw Equivalent: 154
Sugar, Raw Equivalent: 155
Sugar, Raw Equivalent: 160
Sugar, Raw Equivalent: 161
Sugar, Raw Equivalent: 162
Sugar, Raw Equivalent: 163
Sugar, Raw Equivalent: 166
Sugar, Raw Equivalent: 167
Sugar, Raw Equivalent: 172
Sugar, Raw Equivalent: 173
Molasses: 165
Sugar non-centrifugal: 163
Sweeteners, Other: 154
Sweeteners, Other: 155
Sweeteners, Other: 160
Sweeteners, Other: 161
Sweeteners, Other: 165
Sweeteners, Other: 166
Sweeteners, Other: 167
Sweeteners, Other: 172
Sweeteners, Other: 173
Rice (Paddy Equivalent): 27
Rice (Paddy Equivalent): 33
Rice (Paddy Equivalent): 35
Groundnuts (Shelled Eq): 242
Rice (Milled Equivalent): 27
Rice (Milled Equivalent): 33
Rice (Milled Equivalent): 35


There were no redundancies for livestock.
For crops, redundancies were:

* 567: Watermelon
    * Vegetables, Other
    * Fruits, Other
* 568: Melon
    * Vegetables, Other
    * Fruits, Other
* 242: Groundnuts
    * Groundnuts (in Shell Eq)	
    * Groundnuts (Shelled Eq)	
    * Oilcrops Oil, Other
* 154: Fructose chemically pure
    * Sugar, Raw Equivalent	
    * Sweeteners, Other	
* 155: Maltose chemically pure
    * Sugar, Raw Equivalent
    * Sweeteners, Other
* 160: Maple sugar and syrups
    * Sugar, Raw Equivalent	
    * Sweeteners, Other	
* 161: Sugar crops, nes
    * Sugar, Raw Equivalent
    * Sweeteners, Other
* 162: Sugar Raw Centrifugal
    * Sugar (Raw Equivalent)
    * Sugar, Raw Equivalent	
    * Sugar, Refined Equiv	
* 163: Sugar non-centrifugal
    * Sugar non-centrifugal
    * Sugar, Raw Equivalent	
* 165: Molasses
    * Molasses
    * Sweeteners, Other
* 166: Fructose and syrup, other
    * Sugar, Raw Equivalent
    * Sweeteners, Other
* 167: Sugar, nes
    * Sugar, Raw Equivalent	
    * Sweeteners, Other	
* 172: Glucose and dextrose
    * Sugar, Raw Equivalent	
    * Sweeteners, Other	
* 173: Lactose
    * Sugar, Raw Equivalent
    * Sweeteners, Other	
* 27: Rice
    * Rice (Milled Equivalent)
    * Rice (Paddy Equivalent)
* 33: Gluten
    * Rice (Milled Equivalent)	
    * Rice (Paddy Equivalent)	
* 35: Bran, rice
    * Rice (Milled Equivalent)	
    * Rice (Paddy Equivalent)
* 242: Groundnuts
    * Groundnuts (in Shell Eq)	
    * Groundnuts (Shelled Eq)
    * Oilcrops Oil, Other	



We will go through these redundancies using the FAO data inspection tool and build up a list of variables to drop.

In [20]:
# get the cleaned columns so we can check whether the columns containing redunant
# subcategories are currently in our dataset
column_set = set(food_1970_2000_cleaned.columns)

In [21]:
'Vegetables, Other' in column_set

True

In [22]:
'Fruits, Other' in column_set

True

We will keep both `Vegetables, Other` and `Fruits, Other` because both have numerous other constituent crops besides watermelon and melon

In [23]:
'Groundnuts (in Shell Eq)' in column_set

True

In [24]:
'Groundnuts (Shelled Eq)' in column_set

True

In [25]:
'Oilcrops Oil, Other' in column_set

True

We will drop Groundnuts (in Shell Eq) since Groundnuts (Shelled Eq) contains all crops in Groundnuts (in Shell Eq). We will keep `Oilcrops Oil, Other` because it only has one subcategory in common with `Groundnuts (Shelled Eq)` and contains many unique subcategories

In [26]:
'Sugar, Raw Equivalent' in column_set

True

In [27]:
'Sweeteners, Other' in column_set

True

In [28]:
'Sugar non-centrifugal' in column_set

False

In [29]:
'Sugar (Raw Equivalent)' in column_set

True

We will drop `Sweeteners, Other` (the "others" in that category were already dropped because of many Nans) and keep `Sugar, Raw Equivalent` because it seems to contain more relevant subcategories. `Sugar non-centrifugal` appears to have already been dropped. We will keep `Sugar (Raw Equivalent)` because it only has one subcategory in common with `Sugar, Raw Equivalent` and has many additional subcategories that may be relevant. We will drop `Sugar, Refined Equiv` because it only contains one subcategory that is already contained in `Sugar, Raw Equivalent`

In [30]:
'Rice (Paddy Equivalent)' in column_set

True

In [31]:
'Rice (Milled Equivalent)' in column_set

True

`Rice (Milled Equivalent)` contains all the subcategories in `Rice (Paddy Equivalent)` and some additional, so we will drop `Rice (Paddy Equivalent)`

In [32]:
# columns to drop because of overlapping subcategories
more_columns_to_drop = ['Groundnuts (in Shell Eq)',
                        'Sweeteners, Other',
                        'Sugar, Refined Equiv',
                        'Rice (Paddy Equivalent)',
                       ]

In [33]:
# drop the columns
food_1970_2000_cleaned = food_1970_2000_cleaned.drop(more_columns_to_drop, axis = 1)

### Columns with No Descriptions
We also will drop columns that have no descriptions because including them would make our results less interpretable. 

First, we evaluate which columns have no descriptions.

In [34]:
crops_desc[crops_desc["Description"] == 'nan']

Unnamed: 0,Item Code,Item,Description,HS Code,HS07 Code,HS12 Code,CPC Code
0,2924,Alcoholic Beverages,,,,,
9,2905,Cereals - Excluding Beer,,,,,
19,2919,Fruits - Excluding Wine,,,,,
21,2901,Grand Total,,,,,
32,2899,Miscellaneous,,,,,
33,2928,Miscellaneous,,,,,
37,2913,Oilcrops,,,,,
52,2911,Pulses,,,,,
58,2815,Roots & Tuber Dry Equiv,,,,,
66,2923,Spices,,,,,


In [35]:
livestock_desc[livestock_desc["Description"] == 'nan']

Unnamed: 0,Item Code,Item,Description,HS Code,HS07 Code,HS12 Code,CPC Code
0,2946,Animal fats,,,,,
1,2941,Animal Products,,,,,
4,2961,"Aquatic Products, Other",,,,,
8,2741,Cheese,,,,,
13,2949,Eggs,,,,,
18,2960,"Fish, Seafood",,,,,
20,2901,Grand Total,,,,,
24,2943,Meat,,,,,
28,2948,Milk - Excluding Butter,,,,,
30,2738,"Milk, Whole",,,,,


In [36]:
empty_descs = list(livestock_desc[livestock_desc["Description"] == 'nan']['Item'])
empty_descs += list(crops_desc[crops_desc["Description"] == 'nan']['Item'])

In [37]:
# drop the columns
for col in empty_descs:
    if col in food_1970_2000_cleaned.columns:
        food_1970_2000_cleaned = food_1970_2000_cleaned.drop(col, axis = 1)

That looks good! Let's save it now.

In [38]:
# Save this dataframe for later
pickle.dump(food_1970_2000_cleaned, open('data/final/food_1970_2000_cleaned.p', 'wb'))