# KDD Assignment 2
![CS306](https://img.shields.io/badge/CS306-Data%20Mining-orange) &nbsp;
![2022s](https://img.shields.io/badge/semester-2022%20spring-blue)

Author: 何泽安 (He Zean) &nbsp;&nbsp; SID: 12011323

## Part 1. Data Collection

We first analyze the network packages, and find the useful API url. We only need the chinaDayList and chinaDayAddList, so we modify the url as shows in the below code.

<img src="https://i.imgur.com/JcNXIGi.png" alt="api capture" style="zoom:25%;" />

In [None]:
from urllib.request import urlopen
import json

api = 'https://api.inews.qq.com/newsqa/v1/query/inner/publish/modules/list?modules=chinaDayList,chinaDayAddList'
raw_data = urlopen(api).read().decode('utf-8')
raw_data = json.loads(raw_data)['data']

raw_data

## Part 2. Data Cleaning

We then compare the data with the data displayed in the web page, and analyze the labels' correspondance.

```json
{
    "chinaDayAddList": [
        {
            "y": "2022",
            "confirm": 5451,                 // 新增确诊
            "suspect": 0,                    // 新增疑似
            "date": "04.15"
        }
    ],
    "chinaDayList": [
        {
            "y": "2022",
            "nowConfirm": 259560,            // 现有确诊
            "dead": 14561,                   // 累计死亡
            "heal": 227416,                  // 累计治愈
            "confirm": 519822,               // 累计确诊
            "date": "04.15"
        }
    ]
}
```

In [None]:
import pandas as pd

day_add = pd.DataFrame.from_records(raw_data['chinaDayAddList'])
day_add['date'] = day_add['y'] + '-' + day_add['date'].str.replace('\\.', '-', regex=True)
day_add['date'] = pd.to_datetime(day_add['date'])

day_add.drop(day_add.columns.difference(['date', 'confirm', 'suspect']), axis=1, inplace=True)  # keep only info we need

day_add.sort_values(by='date', inplace=True)
day_add = day_add.tail(30)  # we only need the last 30 days
day_add.fillna(day_add.mean(numeric_only=True), inplace=True)  # fill missing values with mean
day_add['day_bias'] = (day_add['date'] - day_add['date'].min()) / pd.Timedelta('1 days') + 1  # the day from the first day in the seq (30 days)

day_add.tail(5)  # preview

In [None]:
day_info = pd.DataFrame.from_records(raw_data['chinaDayList'])
day_info['date'] = day_info['y'] + '-' + day_info['date'].str.replace('\\.', '-', regex=True)
day_info['date'] = pd.to_datetime(day_info['date'])

day_info.drop(day_info.columns.difference(['date', 'dead', 'heal', 'confirm', 'nowConfirm']), axis=1, inplace=True)  # keep only info we need
day_info.sort_values(by='date', inplace=True)
day_info = day_info.tail(30)  # we only need the last 30 days

day_info.fillna(day_info.mean(numeric_only=True), inplace=True)  # fill missing values with mean
day_info['day_bias'] = (day_info['date'] - day_info['date'].min()) / pd.Timedelta('1 days') + 1  # the day from the first day in the seq (30 days)

day_info.tail(5)  # preview

## Part 3. Linear Regression Models

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np


def modeling(name, x, y):
    print(name)
    # training
    model = LinearRegression()
    X_train, X_test, y_train, y_test = train_test_split(
        x.values.reshape(-1, 1), y, test_size=0.2, shuffle=False)  # time series
    model.fit(X_train, y_train)
    print(f'Y = {model.coef_[0]} * X + {model.intercept_}')

    # validating
    rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    print(f'RMSE = {rmse}')

    # predicting
    pred_day_bias = [[x.max() + 1]]  # 2D array for prediction
    print(f'Pred = {model.predict(pred_day_bias)[0]:.3f}')

In [None]:
modeling('Now Confirm (现有确诊)', day_info['day_bias'], day_info['nowConfirm'])

In [None]:
modeling('New Confirm (新增确诊)', day_add['day_bias'], day_add['confirm'])

In [None]:
modeling('Now Suspect (新增疑似)', day_add['day_bias'], day_add['suspect'])

In [None]:
modeling('Accumulated Confirm (累计确诊)', day_info['day_bias'], day_info['confirm'])

In [None]:
modeling('Accumulated Heal (累计治愈)', day_info['day_bias'], day_info['heal'])

In [None]:
modeling('Accumulated Dead (累计死亡)', day_info['day_bias'], day_info['dead'])