# Categorical Data Preparation
Computer können nicht mit kategorischen Variablen arbeiten. Daher müssen wir die kategorischen Variablen in numerische Werte umformen um mit Ihnen weiterarbeiten zu können. Diese Aufbereitung behandeln wir in diesem Notebook.

In [114]:
import pandas as pd
from feature_engine.encoding import RareLabelEncoder, OneHotEncoder, CountFrequencyEncoder

## Loading Data

In [115]:
X_train = pd.read_csv("Xtrain_mod.csv")
X_test = pd.read_csv("Xtest_mod.csv")
ytrain = pd.read_csv("ytrain.csv")
ytest = pd.read_csv("ytest.csv")
print("Shape of X Train: {}".format(X_train.shape))
print("Shape of X Test: {}".format(X_test.shape))
print("Shape of y Train: {}".format(ytrain.shape))
print("Shape of y Test: {}".format(ytest.shape))

Shape of X Train: (8672, 9)
Shape of X Test: (2168, 9)
Shape of y Train: (8672, 1)
Shape of y Test: (2168, 1)


In [116]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8672 entries, 0 to 8671
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Category           8672 non-null   object 
 1   Rating             8672 non-null   float64
 2   Reviews            8672 non-null   int64  
 3   Size               8672 non-null   object 
 4   Type               8672 non-null   object 
 5   Price              8672 non-null   float64
 6   Content Rating     8672 non-null   object 
 7   Genres             8672 non-null   object 
 8   days_since_update  8672 non-null   int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 609.9+ KB


## Categorical Variables
Nun wollen wir vertieft anschauen, wie die kategorischen Variablen aussehen.

In [117]:
X_train_cat = X_train.select_dtypes(include=['object'])

In [118]:
for col in X_train_cat:
    cardinality = len(pd.Index(X_train[col].value_counts()))
    print(X_train[col].name + ": " + str(cardinality))

Category: 33
Size: 11
Type: 2
Content Rating: 6
Genres: 119


## Rare Label Encoder
Nun bereiten wir in einem ersten Schritt die seltene Kategorien auf.

In [119]:
rare_encoder = RareLabelEncoder(
    tol=0.01, # the minimum frequency a label should have to be considered frequent. Categories with frequencies lower than tol will be grouped.
    n_categories=4, # the minimum number of categories a variable should have for the encoder to find frequent labels. If the variable contains less categories, all of them will be considered frequent.
)
rare_encoder.fit(X_train)



In [120]:
rare_encoder.variables_

['Category', 'Size', 'Type', 'Content Rating', 'Genres']

In [121]:
# the encoder_dict_ is a dictionary of variable: frequent labels pair
rare_encoder.encoder_dict_

{'Category': Index(['FAMILY', 'GAME', 'TOOLS', 'MEDICAL', 'BUSINESS', 'PRODUCTIVITY',
        'PERSONALIZATION', 'COMMUNICATION', 'LIFESTYLE', 'SPORTS',
        'HEALTH_AND_FITNESS', 'FINANCE', 'PHOTOGRAPHY', 'SOCIAL',
        'NEWS_AND_MAGAZINES', 'SHOPPING', 'TRAVEL_AND_LOCAL', 'DATING',
        'BOOKS_AND_REFERENCE', 'EDUCATION', 'ENTERTAINMENT', 'VIDEO_PLAYERS',
        'MAPS_AND_NAVIGATION', 'FOOD_AND_DRINK'],
       dtype='object'),
 'Size': Index(['0.1-10MB', 'Varies with device', '10.1-20MB', '20.1-30MB', '30.1-40MB',
        '40.1-50MB', '50.1-60MB', '60.1-70MB', '90.1-100MB', '70.1-80MB',
        '80.1-90MB'],
       dtype='object'),
 'Type': array(['Free', 'Paid'], dtype=object),
 'Content Rating': Index(['Everyone', 'Teen', 'Mature 17+', 'Everyone 10+'], dtype='object'),
 'Genres': Index(['Tools', 'Entertainment', 'Education', 'Medical', 'Business',
        'Productivity', 'Personalization', 'Communication', 'Sports',
        'Lifestyle', 'Action', 'Health & Fitness', 'Fina

In [122]:
X_train_t = rare_encoder.transform(X_train)
X_test_t = rare_encoder.transform(X_test)

In [123]:
X_train_t_cat = X_train_t.select_dtypes(include=['object'])
for col in X_train_t_cat:
    cardinality = len(pd.Index(X_train_t_cat[col].value_counts()))
    print(X_train_t_cat[col].name + ": " + str(cardinality))

Category: 25
Size: 11
Type: 2
Content Rating: 5
Genres: 29


## One-Hot Encoding
In diesem Schritt wenden wir das One-Hot Encoding an

### k One-hot Encoding

In [124]:
# set up encoder

encoder = OneHotEncoder(
    variables=None,  # alternatively pass a list of variables
    top_categories=None,
    drop_last=False,  # to return k-1, use drop=false to return k dummies
)

In [125]:
# fit the encoder (finds categories)

encoder.fit(X_train_t)

In [126]:
# automatically found numerical variables

encoder.variables_

['Category', 'Size', 'Type', 'Content Rating', 'Genres']

In [127]:
# we observe the learned categories

encoder.encoder_dict_

{'Category': ['TRAVEL_AND_LOCAL',
  'VIDEO_PLAYERS',
  'FINANCE',
  'FAMILY',
  'MEDICAL',
  'GAME',
  'SOCIAL',
  'PERSONALIZATION',
  'PHOTOGRAPHY',
  'MAPS_AND_NAVIGATION',
  'HEALTH_AND_FITNESS',
  'COMMUNICATION',
  'Rare',
  'DATING',
  'TOOLS',
  'LIFESTYLE',
  'BUSINESS',
  'PRODUCTIVITY',
  'BOOKS_AND_REFERENCE',
  'SHOPPING',
  'EDUCATION',
  'NEWS_AND_MAGAZINES',
  'SPORTS',
  'FOOD_AND_DRINK',
  'ENTERTAINMENT'],
 'Size': ['Varies with device',
  '0.1-10MB',
  '40.1-50MB',
  '10.1-20MB',
  '90.1-100MB',
  '50.1-60MB',
  '80.1-90MB',
  '30.1-40MB',
  '60.1-70MB',
  '20.1-30MB',
  '70.1-80MB'],
 'Type': ['Free', 'Paid'],
 'Content Rating': ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+', 'Rare'],
 'Genres': ['Travel & Local',
  'Video Players & Editors',
  'Finance',
  'Entertainment',
  'Role Playing',
  'Rare',
  'Medical',
  'Arcade',
  'Education',
  'Social',
  'Personalization',
  'Photography',
  'Maps & Navigation',
  'Health & Fitness',
  'Communication',
  'Dating

In [128]:
 # transform the data sets

X_train_enc = encoder.transform(X_train_t)
X_test_enc = encoder.transform(X_test_t)

X_train_enc.head()

Unnamed: 0,Rating,Reviews,Price,days_since_update,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_FINANCE,Category_FAMILY,Category_MEDICAL,Category_GAME,...,Genres_Productivity,Genres_Books & Reference,Genres_Shopping,Genres_Puzzle,Genres_Casual,Genres_News & Magazines,Genres_Sports,Genres_Action,Genres_Simulation,Genres_Food & Drink
0,4.4,17915,0.0,44,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3.7,321,0.0,535,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,5.0,12,0.0,1821,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4.7,303,0.0,40,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4.5,80904,0.0,57,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Schauen wir nun wie gross der Datensatz geworden ist

In [129]:
print("Die Dimensionen des Urpsrungsdatensatzes : {}".format(X_train_t.shape))
print("Die Dimensionen des umgeformten Datensatzes : {}".format(X_train_enc.shape))

Die Dimensionen des Urpsrungsdatensatzes : (8672, 9)
Die Dimensionen des umgeformten Datensatzes : (8672, 76)


Speichern wir die Umformung

In [130]:
X_train_enc.to_csv("Xtrain_k_one_hot.csv",index=False)
X_test_enc.to_csv("Xtest_k_one_hot.csv",index=False)

### k-1 One-hot Encoding

In [131]:
# set up encoder

encoder = OneHotEncoder(
    variables=None,  # alternatively pass a list of variables
    top_categories=None,
    drop_last=True,  # to return k-1, use drop=false to return k dummies
)
# fit the encoder (finds categories)

encoder.fit(X_train_t)


In [132]:
# automatically found numerical variables

encoder.variables_

['Category', 'Size', 'Type', 'Content Rating', 'Genres']

In [133]:
# we observe the learned categories

encoder.encoder_dict_

{'Category': ['TRAVEL_AND_LOCAL',
  'VIDEO_PLAYERS',
  'FINANCE',
  'FAMILY',
  'MEDICAL',
  'GAME',
  'SOCIAL',
  'PERSONALIZATION',
  'PHOTOGRAPHY',
  'MAPS_AND_NAVIGATION',
  'HEALTH_AND_FITNESS',
  'COMMUNICATION',
  'Rare',
  'DATING',
  'TOOLS',
  'LIFESTYLE',
  'BUSINESS',
  'PRODUCTIVITY',
  'BOOKS_AND_REFERENCE',
  'SHOPPING',
  'EDUCATION',
  'NEWS_AND_MAGAZINES',
  'SPORTS',
  'FOOD_AND_DRINK'],
 'Size': ['Varies with device',
  '0.1-10MB',
  '40.1-50MB',
  '10.1-20MB',
  '90.1-100MB',
  '50.1-60MB',
  '80.1-90MB',
  '30.1-40MB',
  '60.1-70MB',
  '20.1-30MB'],
 'Type': ['Free'],
 'Content Rating': ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+'],
 'Genres': ['Travel & Local',
  'Video Players & Editors',
  'Finance',
  'Entertainment',
  'Role Playing',
  'Rare',
  'Medical',
  'Arcade',
  'Education',
  'Social',
  'Personalization',
  'Photography',
  'Maps & Navigation',
  'Health & Fitness',
  'Communication',
  'Dating',
  'Tools',
  'Lifestyle',
  'Business',
  'Prod

In [134]:
 # transform the data sets

X_train_enc = encoder.transform(X_train_t)
X_test_enc = encoder.transform(X_test_t)

X_train_enc.head()

Unnamed: 0,Rating,Reviews,Price,days_since_update,Category_TRAVEL_AND_LOCAL,Category_VIDEO_PLAYERS,Category_FINANCE,Category_FAMILY,Category_MEDICAL,Category_GAME,...,Genres_Business,Genres_Productivity,Genres_Books & Reference,Genres_Shopping,Genres_Puzzle,Genres_Casual,Genres_News & Magazines,Genres_Sports,Genres_Action,Genres_Simulation
0,4.4,17915,0.0,44,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,3.7,321,0.0,535,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,5.0,12,0.0,1821,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4.7,303,0.0,40,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4.5,80904,0.0,57,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Schauen wir uns nun die Dimension des Datensatzes an

In [135]:
print("Die Dimensionen des Urpsrungsdatensatzes : {}".format(X_train_t.shape))
print("Die Dimensionen des umgeformten Datensatzes : {}".format(X_train_enc.shape))

Die Dimensionen des Urpsrungsdatensatzes : (8672, 9)
Die Dimensionen des umgeformten Datensatzes : (8672, 71)


Speichern wir nun die Daten

In [136]:
# X_train_enc.to_csv("Xtrain_booking_k_1_one_hot.csv",index=False)
# X_test_enc.to_csv("Xtest_booking_k_1_one_hot.csv",index=False)

## Count-Frequency Encoding

In [137]:
count_enc = CountFrequencyEncoder(
    encoding_method="count",  # to do frequency ==> encoding_method='frequency'
    variables=None
)

count_enc.fit(X_train_t)

In [138]:
# in the encoder dict we can observe the number of
# observations per category for each variable

count_enc.encoder_dict_

{'Category': {'FAMILY': 1583,
  'GAME': 911,
  'TOOLS': 692,
  'Rare': 531,
  'MEDICAL': 367,
  'BUSINESS': 364,
  'PRODUCTIVITY': 333,
  'PERSONALIZATION': 321,
  'COMMUNICATION': 317,
  'LIFESTYLE': 299,
  'SPORTS': 289,
  'HEALTH_AND_FITNESS': 278,
  'FINANCE': 274,
  'PHOTOGRAPHY': 266,
  'SOCIAL': 231,
  'NEWS_AND_MAGAZINES': 227,
  'SHOPPING': 211,
  'TRAVEL_AND_LOCAL': 209,
  'DATING': 188,
  'BOOKS_AND_REFERENCE': 177,
  'EDUCATION': 135,
  'ENTERTAINMENT': 134,
  'VIDEO_PLAYERS': 129,
  'MAPS_AND_NAVIGATION': 109,
  'FOOD_AND_DRINK': 97},
 'Size': {'0.1-10MB': 3253,
  'Varies with device': 1368,
  '10.1-20MB': 1359,
  '20.1-30MB': 949,
  '30.1-40MB': 513,
  '40.1-50MB': 394,
  '50.1-60MB': 263,
  '60.1-70MB': 185,
  '90.1-100MB': 163,
  '70.1-80MB': 130,
  '80.1-90MB': 95},
 'Type': {'Free': 8020, 'Paid': 652},
 'Content Rating': {'Everyone': 6954,
  'Teen': 977,
  'Mature 17+': 398,
  'Everyone 10+': 339,
  'Rare': 4},
 'Genres': {'Rare': 1340,
  'Tools': 691,
  'Entertainmen

In [139]:
X_train_enc = count_enc.transform(X_train)
X_test_enc = count_enc.transform(X_test)

# let's explore the result
X_train_enc.head()



Unnamed: 0,Category,Rating,Reviews,Size,Type,Price,Content Rating,Genres,days_since_update
0,209.0,4.4,17915,1368,8020,0.0,6954.0,208.0,44
1,129.0,3.7,321,3253,8020,0.0,6954.0,127.0,535
2,274.0,5.0,12,3253,8020,0.0,6954.0,274.0,1821
3,1583.0,4.7,303,394,8020,0.0,6954.0,517.0,40
4,274.0,4.5,80904,1359,8020,0.0,6954.0,274.0,57


Schauen wir nun uns die Dimensionen des Datensatzes an

In [140]:
print("Die Dimensionen des Urpsrungsdatensatzes : {}".format(X_train_t.shape))
print("Die Dimensionen des umgeformten Datensatzes : {}".format(X_train_enc.shape))

Die Dimensionen des Urpsrungsdatensatzes : (8672, 9)
Die Dimensionen des umgeformten Datensatzes : (8672, 9)


Speichern wir nun die Daten

In [141]:
# X_train_enc.to_csv("Xtrain_booking_count_cat_hot.csv",index=False)
# X_test_enc.to_csv("Xtest_booking_count_cat_hot.csv",index=False)