<a href="https://colab.research.google.com/github/kleczekr/dtc/blob/main/dtc_tags_categories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Fixing the category issue

The fact that articles in the Days to Come magazine have predominantly multiple categories is a bit of a bother. Here I present a simplistic way of selecting the most popular category from each lists of categories for each of the articles.

In [None]:
import pandas as pd
from collections import Counter
import operator

In [None]:
df = pd.read_csv('path_to_file', 
                 converters={'categories': eval, 
                             'tags': eval})

In [None]:
# let's count how long are the category lists in the articles
df.categories.map(len).value_counts()

2    9943
1    9211
3    1618
4      93
7      28
Name: categories, dtype: int64

It seems that most of the articles have two categories, but in some the category count comes up to seven. In order to divide the articles into neat groups, I need one category per article. In order for the categories to be most comprehensive, I'd like to select the most popular categories. The first step towards that is by simply counting how often the categories occur.

In [None]:
cats = df.categories

In [None]:
catcount = Counter()
for list_ in cats:
  for item in list_:
    catcount[item] += 1

In [None]:
catcount

Counter({'Africa': 1028,
         'Asia': 3600,
         'Central America': 22,
         'Culture': 883,
         'Destination Guide': 79,
         'Destinations': 3743,
         'Europe': 4147,
         'Food & Drink': 1551,
         'Lifestyle': 5110,
         'Local Travel': 92,
         'North America': 1748,
         'Oceania': 592,
         'Photography': 659,
         'South America': 926,
         'Stories': 642,
         'Sustainable Travel': 387,
         'Tips & Tricks': 5196,
         'Tour Operators': 428,
         'Virtual Travel': 98,
         'Worldly Insights': 3586,
         'no categories': 2})

The next step is actually selecting the most popular category in a list of categories. I was actually tempted to use an if... elif... else loop to do that, but it would be rather error-prone as the number of categories per article grows. Eventually I opted for a simpler solution:

In [None]:
overarching_cat = []
for list_ in cats:
  temp = []
  for item in list_:
    temp.append(catcount[item])
  max_index = temp.index(max(temp))
  overarching_cat.append(list_[max_index])

In [None]:
# what is left to be done is simply inserting the created list
# as a se[arate column in the dataframe]
df.insert(5, 'overarching_category', overarching_cat)