# <p style="text-align: center;"> Final Project - Introduction to Data Science</p>
# <p style="text-align: center;"> <b> Data Collecting </b></p>
---

## Member Information
| Name              | ID       |
|-------------------|----------|
|Tran Đinh Quang    | 21127406 |
|Nguyen Hong Hanh   | 21127503 |
|Do Quoc Tri  | 21127556 |
| Nguyen Khanh Nhan | 21127657         |


## Table of contents



---

## Import libaries

In [2]:
import requests
import sys
import json
import pandas as pd
from datetime import datetime, timedelta, date
import os

## Data introduction

The data is fetched from the [Visualcrossings](https://ww.visualcrossing.com/) website. The data in this study pertains to the weather in Ho Chi Minh City from 27/09/2009 to 05/06/2023. Due to the ID code limitations allowing the retrieval of only 1000 queries of data per day, each run can obtain 1000 queries of data and record the date of the run into a file for subsequent runs to continue.

## Get the data

You can obtain a free account at Visualcrossings which comes with a key.

In [2]:
key = 'PEVKTTBLZUTL89Z6WHQ5V2HRK'
num_of_days = 999

with open('../data/lastdate.txt', 'r') as file:
  start = file.read()

end = datetime.strptime(start, '%Y-%m-%d') + timedelta(days=num_of_days)
end = end.strftime('%Y-%m-%d')

**Causion** The next cell will get the data from the website. As we can only get 1000 queries per day, **ONLY RUN IT ONCE**

In [3]:
response = requests.request("GET", f"https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timeline/Ho%20Chi%20Minh/{start}/{end}?unitGroup=us&include=days&key={key}&contentType=json")
if response.status_code!=200:
  print('Unexpected Status code: ', response.status_code)
  sys.exit()  

jsonData = response.json()

## Check the location infomation

In [4]:
location_info ={}
features = ['latitude', 'longitude', 'resolvedAddress', 'address', 'timezone', 'tzoffset']
for f in features:
  location_info[f] = jsonData[f]
location_info

{'latitude': 10.7764,
 'longitude': 106.701,
 'resolvedAddress': 'Quận 1, Hồ Chí Minh, Việt Nam',
 'address': 'Ho Chi Minh',
 'timezone': 'Asia/Ho_Chi_Minh',
 'tzoffset': 7.0}

## Get some data samples

In [5]:
df = pd.DataFrame.from_dict(jsonData['days'])
df.sample()

Unnamed: 0,datetime,datetimeEpoch,tempmax,tempmin,temp,feelslikemax,feelslikemin,feelslike,dew,humidity,...,sunriseEpoch,sunset,sunsetEpoch,moonphase,conditions,description,icon,stations,source,severerisk
870,2023-01-27,1674752400,91.4,75.1,81.4,98.1,75.1,84.7,72.7,76.4,...,1674775018,17:54:53,1674816893,0.18,"Rain, Partially cloudy",Partly cloudy throughout the day with late aft...,rain,"[48894099999, 48900099999, VVTS]",obs,10.0


## Save the data to file

In [6]:
df.to_csv(f'../data/raw_{start}_{end}.csv', index=False)

**Caution** The following cell will save the date for next time run. **Do not run next cell before finished**

In [7]:
end = datetime.strptime(end, '%Y-%m-%d') + timedelta(days=1)
end = end.strftime('%Y-%m-%d')
with open('../data/lastdate.txt', 'w') as file:
  file.write(end)

## Save all data to raw.csv

[How to Combine Multiple CSV Files into One File Using Python: A Step-by-Step Guide (2023)](https://raredogmarketing.com/resources/combining-multiple-csv-files-into-one-file-using-python-step-by-step-guide-2023)

In [10]:
data_path = '../data/'
all_files = os.listdir(data_path)
csv_files = [f for f in all_files if f.endswith('.csv')]
try:
    csv_files.remove('raw.csv')
except ValueError:
    pass
df_list = []
for csv in csv_files:
    file_path = os.path.join(data_path, csv)
    try:
        # Try reading the file using default UTF-8 encoding
        df = pd.read_csv(file_path)
        df_list.append(df)
    except UnicodeDecodeError:
        try:
            # If UTF-8 fails, try reading the file using UTF-16 encoding with tab separator
            df = pd.read_csv(file_path, sep='\t', encoding='utf-16')
            df_list.append(df)
        except Exception as e:
            print(f"Could not read file {csv} because of error: {e}")
    except Exception as e:
        print(f"Could not read file {csv} because of error: {e}")

raw_df = pd.concat(df_list, ignore_index=True)

# Save the final result to a new CSV file
raw_df.to_csv(os.path.join(data_path, 'raw.csv'), index=False)