<img src="../assets/headline.png" alt="headline"></a>

## Preprocessing, Exercise 2.3

## SUIT: Univariate Feature Selection

Lecture 2, November 9th, 2022

# 1.0 Setup
If you haven't done so already, go to section 1.0 in l1_warmup.ipynb notebook.

## 1.1 Let's import some packages

In [2]:
import pandas as pd
from utilities.self_tests import test_your_notebook

# 1.2 Explanatory Data Analysis


### Can we skip the PHIDS?
We can skip the PHID if we saw this data a few minutes ago, but we can never skip set-index!

In [3]:
# set index
events = pd.read_csv('../data/tiktok_events.csv')
events['datetime'] = pd.to_datetime(events['datetime'])
events.set_index('datetime',inplace=True)

In [4]:
# Let's just make sure that this is our right data...
events.head()

Unnamed: 0_level_0,views,likes,city,age_group
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01 00:00:00,1,0,rishon,18-21
2022-01-01 00:00:00,1,0,kadima,13-15
2022-01-01 00:00:00,1,0,ramat-gan,18-21
2022-01-01 00:00:00,1,0,tel-aviv,Prefer not to Say
2022-01-01 00:00:27,1,0,tel-aviv,21-25


# 2.0 Resample, Resample, Resample!

**{Q} 3 A. What are the maximal views per event in the first day?**

In [5]:
# Hint: Resample gives us some free stats already!
# Checkout https://pandas.pydata.org/pandas-docs/stable/reference/resampling.html

events.resample('1D').views.max()

datetime
2022-01-01    4
2022-01-02    4
2022-01-03    4
2022-01-04    4
2022-01-05    4
             ..
2022-11-03    4
2022-11-04    4
2022-11-05    4
2022-11-06    4
2022-11-07    4
Freq: D, Name: views, Length: 311, dtype: int64

In [6]:
A = 4

**{Q} 3 B. How many views from Tel Aviv did we get on 2022-01-03?**

In [8]:
events.resample('1D').aggregate(
    tel_aviv = ("city", lambda x: sum(x=="tel-aviv"))
)

Unnamed: 0_level_0,tel_aviv
datetime,Unnamed: 1_level_1
2022-01-01,311
2022-01-02,340
2022-01-03,500
2022-01-04,322
2022-01-05,425
...,...
2022-11-03,1066
2022-11-04,1205
2022-11-05,1381
2022-11-06,1350


In [10]:
B = 500

**{Q} 3 C. What is the views share from Tel-Aviv on 2022-01-04? State in int please**
note: assume only a single view per row for now.

In [11]:
events.resample('1D').aggregate(
    tel_aviv = ("city", lambda x: sum(x=="tel-aviv") / len(x)),
)

Unnamed: 0_level_0,tel_aviv
datetime,Unnamed: 1_level_1
2022-01-01,0.789340
2022-01-02,0.656371
2022-01-03,0.696379
2022-01-04,0.602996
2022-01-05,0.704809
...,...
2022-11-03,0.369114
2022-11-04,0.404906
2022-11-05,0.414715
2022-11-06,0.402504


In [16]:
C = 60

**{Q} 2 D. Combine two reducers: Daily Like Ratio, and Daily Tel-Aviv views share. Which is Larger in 2022-01-01? State 1 for Total, 2 for Tel-Aviv**

In [17]:
events.resample('1D').aggregate(
    like_ratio = ("likes", lambda x: sum(x)/len(x)),
    tel_aviv = ("city", lambda x: sum(x=="tel-aviv") / len(x))
)

Unnamed: 0_level_0,like_ratio,tel_aviv
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-01-01,0.164975,0.789340
2022-01-02,0.160232,0.656371
2022-01-03,0.165738,0.696379
2022-01-04,0.127341,0.602996
2022-01-05,0.157546,0.704809
...,...,...
2022-11-03,0.153393,0.369114
2022-11-04,0.125000,0.404906
2022-11-05,0.157658,0.414715
2022-11-06,0.144007,0.402504


In [18]:
D = 2

**{Q} 2 E. How many likes did we get from Tel-Aviv on 2022-01-05?**

In [22]:
events.resample('1D').apply(lambda df: df[df['city']=='tel-aviv'].likes.sum())

datetime
2022-01-01     48
2022-01-02     55
2022-01-03     94
2022-01-04     44
2022-01-05     69
             ... 
2022-11-03    191
2022-11-04    170
2022-11-05    240
2022-11-06    238
2022-11-07    139
Freq: D, Length: 311, dtype: int64

In [23]:
E = 69

**{Q} 2 F. Repeat E with external lambda. How many likes did we get from Tel-Aviv on 2022-01-05?**

In [24]:
def get_tlv_likes(df: pd.DataFrame) -> int:
    tel_aviv = df[df['city'] == 'tel-aviv']
    likes = tel_aviv['likes'].sum()
    return likes

events.resample('1D').apply(get_tlv_likes)

datetime
2022-01-01     48
2022-01-02     55
2022-01-03     94
2022-01-04     44
2022-01-05     69
             ... 
2022-11-03    191
2022-11-04    170
2022-11-05    240
2022-11-06    238
2022-11-07    139
Freq: D, Length: 311, dtype: int64

In [25]:
F = 69

# 3. Datetime Features

We know from last week that seasonal features may become handy!

<img src="../assets/resampler_index.png" alt="resampler_index"></a>


**{Q} 2 G. Add features of day-of-week, week and month to Q4(D). What is the day-of-week of 2022-01-01?(state integer)**

In [27]:
d_resampler = events.resample('1D').aggregate(
    likes_ratio = ("likes", lambda x: sum(x) / len(x)),
    tel_aviv = ("city", lambda x: sum(x=="tel-aviv") / len(x))
)

isocal = d_resampler.index.isocalendar()

d_resampler['dow'] = isocal.day
d_resampler['week'] = isocal.week
d_resampler['month'] = d_resampler.index.month

d_resampler

Unnamed: 0_level_0,likes_ratio,tel_aviv,dow,week,month
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-01-01,0.164975,0.789340,6,52,1
2022-01-02,0.160232,0.656371,7,52,1
2022-01-03,0.165738,0.696379,1,1,1
2022-01-04,0.127341,0.602996,2,1,1
2022-01-05,0.157546,0.704809,3,1,1
...,...,...,...,...,...
2022-11-03,0.153393,0.369114,4,44,11
2022-11-04,0.125000,0.404906,5,44,11
2022-11-05,0.157658,0.414715,6,44,11
2022-11-06,0.144007,0.402504,7,44,11


In [28]:
G = 6

In [29]:
# cleanup
del events

In [30]:
# Run this code block to get your grade!
NOTEBOOK_CODE = "E3"
test_your_notebook(NOTEBOOK_CODE, A, B, C, D, E, F, G)

You Rockstar! That's an A! Your Final Grade: 100
-------------
Grade analysis:
4 - True
500 - True
60 - True
2 - True
69 - True
69 - True
6 - True
