# 0. Neccessary Import Statements

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# 1. Download the Data

In [2]:
taxonomy_df = pd.read_csv('../data/annotations_tracking_Taxonomy.csv', 
                          usecols = ["ID", "Parent", "Tier1", "Tier2", "Tier3", "Tier4"])

In [3]:
taxonomy_df.head()

Unnamed: 0,ID,Parent,Tier1,Tier2,Tier3,Tier4
0,1.0,,Automotive,,,
1,2.0,1.0,Automotive,Auto Body Styles,,
2,3.0,2.0,Automotive,Auto Body Styles,Commercial Trucks,
3,4.0,2.0,Automotive,Auto Body Styles,Sedan,
4,5.0,2.0,Automotive,Auto Body Styles,Wagon,


In [4]:
taxonomy_df.tail()

Unnamed: 0,ID,Parent,Tier1,Tier2,Tier3,Tier4
1175,1476.0,1228.0,Content Source Geo,Country,Zambia,
1176,1477.0,1228.0,Content Source Geo,Country,Zimbabwe,
1177,1478.0,1219.0,Content Source Geo,Region/State,,
1178,1479.0,1219.0,Content Source Geo,Metro,,
1179,1480.0,1219.0,Content Source Geo,City,,


# 2. Clean up the Data

Let's first take at look at the types of each column. It makes sense that we see the tier columns designated as `object` due to the prescence of `NaN`'s since not every sequence of classes is comprised of the same number of them.

In [5]:
taxonomy_df.dtypes

ID        float64
Parent    float64
Tier1      object
Tier2      object
Tier3      object
Tier4      object
dtype: object

Double-check that there is nothing weird going on with the instances of `NaN`. Doing this will also help give us a sense of, for example, how many class sequences are comprised of 4 tiers.

In [6]:
(taxonomy_df.isna().sum() / len(taxonomy_df))*100

ID         0.084746
Parent     3.050847
Tier1      0.084746
Tier2      3.050847
Tier3     50.508475
Tier4     94.830508
dtype: float64

So we shouldn't be suprised at the large nunber of `NaN`'s in the `Tier3` and `Tier4` columns. BUT it is a bit strange to see some in the `Tier1` and `ID` column. Let's investigate (especially because this could be totally okay and we just don't understand the way the taxonomy is structured as well as we think we do).

In [7]:
taxonomy_df[taxonomy_df.Tier1.isna()]

Unnamed: 0,ID,Parent,Tier1,Tier2,Tier3,Tier4
698,,,,,,


Ahh so the instance of `Tier1` being `NaN` is simply due to a single row that is entirely composed of `NaN` values. We will want to drop this row.

In [8]:
taxonomy_df.drop(index = 698).isna().sum()

ID           0
Parent      35
Tier1        0
Tier2       35
Tier3      595
Tier4     1118
dtype: int64

This fixes this specific issue! **Will do below.**

Now at let's take a look at why the Name column might have some `NaN`-values.

In [9]:
taxonomy_df[taxonomy_df.Parent.isna()]

Unnamed: 0,ID,Parent,Tier1,Tier2,Tier3,Tier4
0,1.0,,Automotive,,,
41,42.0,,Books and Literature,,,
51,52.0,,Business and Finance,,,
122,123.0,,Careers,,,
131,132.0,,Education,,,
149,150.0,,Events and Attractions,,,
185,186.0,,Family and Relationships,,,
200,201.0,,Fine Art,,,
209,210.0,,Food & Drink,,,
222,223.0,,Healthy Living,,,


In [10]:
(taxonomy_df[taxonomy_df.Tier2.isna()].ID == taxonomy_df[taxonomy_df.Parent.isna()].ID).sum()

35

Evidently, these are all of the parent classes in this taxonomy. By parent classes, we mean those classes whose only parent is the most-generic label in which all instances belong to. Thus, the presence of these `NaN` is totally expected and will be **left alone**. We also see that all of the instances in which `Tier2` is `NaN` correspond to these very rows as well! This is good behavior that we can be quite happy with.

The only thing left to investiage is why some instance have a `NaN` in the ID column

In [11]:
taxonomy_df[taxonomy_df.ID.isna()]

Unnamed: 0,ID,Parent,Tier1,Tier2,Tier3,Tier4
698,,,,,,


Not suprisignly, this is occuring in that completely blank row that was identified above. Thus, we have even more incentive to drop it.

In [12]:
cleaned_taxonomy_df = taxonomy_df.drop(index = 698)

In [13]:
cleaned_taxonomy_df

Unnamed: 0,ID,Parent,Tier1,Tier2,Tier3,Tier4
0,1.0,,Automotive,,,
1,2.0,1.0,Automotive,Auto Body Styles,,
2,3.0,2.0,Automotive,Auto Body Styles,Commercial Trucks,
3,4.0,2.0,Automotive,Auto Body Styles,Sedan,
4,5.0,2.0,Automotive,Auto Body Styles,Wagon,
...,...,...,...,...,...,...
1175,1476.0,1228.0,Content Source Geo,Country,Zambia,
1176,1477.0,1228.0,Content Source Geo,Country,Zimbabwe,
1177,1478.0,1219.0,Content Source Geo,Region/State,,
1178,1479.0,1219.0,Content Source Geo,Metro,,


In [14]:
cleaned_taxonomy_df.isna().sum()

ID           0
Parent      35
Tier1        0
Tier2       35
Tier3      595
Tier4     1118
dtype: int64

This cleaned DataFrame is structured perfectly well!

# 3. Analyze the Data

#### 3.1 Let's get a count of each unique class sequence

Note that this will also help us verify that each class sequence (row) that we have is unique.

In [15]:
parent_classes_list = cleaned_taxonomy_df[cleaned_taxonomy_df.Parent.isna()].Tier1.tolist()
parent_classes_list

['Automotive',
 'Books and Literature',
 'Business and Finance',
 'Careers',
 'Education',
 'Events and Attractions',
 'Family and Relationships',
 'Fine Art',
 'Food & Drink',
 'Healthy Living',
 'Hobbies & Interests',
 'Home & Garden',
 'Medical Health',
 'Movies',
 'Music and Audio',
 'News and Politics',
 'Personal Finance',
 'Pets',
 'Pop Culture',
 'Real Estate',
 'Religion & Spirituality',
 'Science',
 'Shopping',
 'Sports',
 'Style & Fashion',
 'Technology & Computing',
 'Television',
 'Travel',
 'Video Gaming',
 'Content Channel',
 'Content Type',
 'Content Media Format',
 'Content Language',
 'Content Source',
 'Content Source Geo']

In [16]:
def class_sequence_compiler(row):
    """
    """
    to_return = []
    ### Compile all of the Tier labels.
    tier_1_str, tier_2, tier_3, tier_4 = row.Tier1, row.Tier2, row.Tier3, row.Tier4
    assert(type(tier_1_str) == str)
        # just to make sure nothing weird is going on as we 
        # extract the labels.
    to_return.append(tier_1_str)
    
    ### Now determine how many tiers should be appended.
    tier_2_str, tier_3_str, tier_4_str = str(tier_2), str(tier_3), str(tier_4)
    nan_for_all_conditions = [tier_2_str.lower() == 'nan',
                              tier_3_str.lower() == 'nan',
                              tier_4_str.lower() == 'nan']
    if np.all(nan_for_all_conditions):
        # If we are working with a row that corresponds to a parent node. 
        # If this is the case, then we don't want to do any compilation since
        # it would be counter-productive.
        to_return = np.NaN
    elif np.all(nan_for_all_conditions[1::]):
        # if both tier 3 and tier 4 are NaN.
        to_return.append(tier_2_str)
    elif np.all(nan_for_all_conditions[2::]):
        # if only tier 4 has a NaN value.
        to_return.append(tier_2_str)
        to_return.append(tier_3_str)
    else:
        # if we are in one of the rare cases in which all 4 tiers do NOT have
        # NaN values.
        to_return.append(tier_2_str)
        to_return.append(tier_3_str)
        to_return.append(tier_4_str)
    
    return to_return

In [17]:
sequence_of_classes_series = cleaned_taxonomy_df.apply(class_sequence_compiler, axis = 1).dropna()

In [18]:
parent_class_counts_list = [len([seq for seq in sequence_of_classes_series.tolist() if seq[0] == parent_class]) for parent_class in parent_classes_list]
print(parent_class_counts_list)

[40, 9, 70, 8, 17, 35, 14, 8, 12, 15, 34, 11, 37, 13, 40, 11, 30, 9, 8, 11, 10, 8, 9, 68, 43, 43, 12, 26, 18, 9, 11, 7, 184, 3, 261]


In [19]:
dict(zip(parent_classes_list, parent_class_counts_list))

{'Automotive': 40,
 'Books and Literature': 9,
 'Business and Finance': 70,
 'Careers': 8,
 'Education': 17,
 'Events and Attractions': 35,
 'Family and Relationships': 14,
 'Fine Art': 8,
 'Food & Drink': 12,
 'Healthy Living': 15,
 'Hobbies & Interests': 34,
 'Home & Garden': 11,
 'Medical Health': 37,
 'Movies': 13,
 'Music and Audio': 40,
 'News and Politics': 11,
 'Personal Finance': 30,
 'Pets': 9,
 'Pop Culture': 8,
 'Real Estate': 11,
 'Religion & Spirituality': 10,
 'Science': 8,
 'Shopping': 9,
 'Sports': 68,
 'Style & Fashion': 43,
 'Technology & Computing': 43,
 'Television': 12,
 'Travel': 26,
 'Video Gaming': 18,
 'Content Channel': 9,
 'Content Type': 11,
 'Content Media Format': 7,
 'Content Language': 184,
 'Content Source': 3,
 'Content Source Geo': 261}

Evidently, the number of nodes for each class is fairly similar with all but 2 staying under 100. `Content Language` and `Content Source Geo` are the two outlier cases since each have a significantly larger number of nodes than the rest. We'll have to keep this in mind as we're building our model.

In [21]:
cleaned_taxonomy_df["Hi"] = sequence_of_classes_series

In [22]:
cleaned_taxonomy_df

Unnamed: 0,ID,Parent,Tier1,Tier2,Tier3,Tier4,Hi
0,1.0,,Automotive,,,,
1,2.0,1.0,Automotive,Auto Body Styles,,,"[Automotive, Auto Body Styles]"
2,3.0,2.0,Automotive,Auto Body Styles,Commercial Trucks,,"[Automotive, Auto Body Styles, Commercial Trucks]"
3,4.0,2.0,Automotive,Auto Body Styles,Sedan,,"[Automotive, Auto Body Styles, Sedan]"
4,5.0,2.0,Automotive,Auto Body Styles,Wagon,,"[Automotive, Auto Body Styles, Wagon]"
...,...,...,...,...,...,...,...
1175,1476.0,1228.0,Content Source Geo,Country,Zambia,,"[Content Source Geo, Country, Zambia]"
1176,1477.0,1228.0,Content Source Geo,Country,Zimbabwe,,"[Content Source Geo, Country, Zimbabwe]"
1177,1478.0,1219.0,Content Source Geo,Region/State,,,"[Content Source Geo, Region/State]"
1178,1479.0,1219.0,Content Source Geo,Metro,,,"[Content Source Geo, Metro]"


In [23]:
cleaned_taxonomy_df.apply(class_sequence_compiler, axis = 1)

0                                                     NaN
1                          [Automotive, Auto Body Styles]
2       [Automotive, Auto Body Styles, Commercial Trucks]
3                   [Automotive, Auto Body Styles, Sedan]
4                   [Automotive, Auto Body Styles, Wagon]
                              ...                        
1175                [Content Source Geo, Country, Zambia]
1176              [Content Source Geo, Country, Zimbabwe]
1177                   [Content Source Geo, Region/State]
1178                          [Content Source Geo, Metro]
1179                           [Content Source Geo, City]
Length: 1179, dtype: object

In [24]:
my_series = _

0                                                     NaN
1                          [Automotive, Auto Body Styles]
2       [Automotive, Auto Body Styles, Commercial Trucks]
3                   [Automotive, Auto Body Styles, Sedan]
4                   [Automotive, Auto Body Styles, Wagon]
                              ...                        
1175                [Content Source Geo, Country, Zambia]
1176              [Content Source Geo, Country, Zimbabwe]
1177                   [Content Source Geo, Region/State]
1178                          [Content Source Geo, Metro]
1179                           [Content Source Geo, City]
Length: 1179, dtype: object