# Introduction




In this initial section of the notebook, we establish the context and objectives of the data processing stage within our project. This notebook is crucial as a bridge between the acquisition of raw data and its detailed analysis and subsequent modeling. Here, we focus on understanding and preparing our data to ensure that it is of the highest quality and in a format suitable for advanced analysis.

The main objectives of this notebook include:

1. **Understanding the Dataset:** Gaining a clear view of the structure, content, and peculiarities of the data we have collected.

2. **Data Preparation:** Implementing cleaning and transformation steps to convert raw data into a more usable and meaningful format for future analysis.

3. **Establishing a Solid Foundation for Analysis:** Ensuring that the data is ready and accessible for performing statistical analyses, visualizations, and data modeling in the next steps of our project.

By the end of this process, we will have a clean, organized, and well-documented dataset, ready for in-depth exploration and analysis.



## Import libraries


In this section of the notebook, we import three key libraries that are fundamental to our data cleaning and preparation process:

1. **Importing Project's Paths Module (`paths`):**
   - `import final_project.utils.paths as path`:
     - Here, we import the `paths` module from the `final_project.utils` package. This module is used for managing file and directory paths within the project, which facilitates organized and coherent path management. Importing it as `path` allows us to access these predefined paths more simply and directly.

2. **Importing the `janitor` Library:**
   - `import janitor`:
     - `janitor` is a library that provides data cleaning functions for Pandas, making common data cleaning tasks easier and improving code readability. These functions include, among others, cleaning column names, removing duplicate rows, and managing missing values.

3. **Importing Pandas:**
   - `import pandas as pd`:
     - Pandas is an essential library in data science for manipulating and analyzing data in Python. Its main data structure, the DataFrame, allows for easy manipulation of tabular data with numerous operations for filtering, sorting, and summarizing.

These libraries form the foundation of our data processing environment, enabling us to efficiently handle data from loading to cleaning and preparing it for subsequent analysis.


In [1]:
import final_project.utils.paths as path
import janitor
import pandas as pd



## Read data



- `input_covid_file = path.data_raw_dir("time_series_covid19_confirmed_global.csv")`:
  - We are defining the variable `input_covid_file`, which will be used to store the path to the raw COVID-19 data file.
  - The function `data_raw_dir` from the `path` module (previously imported) is used here. This function, a part of our project’s path management structure, is designed to return the complete path to the specific directory where raw data is stored.
  - We pass the file name `"time_series_covid19_confirmed_global.csv"` as an argument to the function, indicating that we are interested in the path to this specific file.
  - This practice ensures a coherent and centralized management of file paths in the project, improving reproducibility and reducing errors caused by hard-coded file paths.

The use of this `input_covid_file` variable in later stages of the notebook will allow us to load and manipulate the COVID-19 data easily and accurately.


In [2]:
input_covid_file = path.data_raw_dir("time_series_covid19_confirmed_global.csv")


### Loading and Initial Review of the COVID-19 Data

- `covid_df = pd.read_csv(input_covid_file)`:
  - In this line, we use the `read_csv` function from Pandas to load the COVID-19 data into a DataFrame named `covid_df`.
  - The variable `input_covid_file`, which contains the path to the raw data file, is used as an argument, ensuring that we are loading the correct file.
  - This is a crucial step in data processing, where we transform the raw data stored in a CSV file into a DataFrame structure, which is more versatile and convenient for analysis in Python.

- `covid_df.info()`:
  - This method provides essential information about the `covid_df` DataFrame, including the number of entries, the name of each column, the number of non-null values, and the data type of each column.
  - It is a standard practice in data analysis to get a quick overview of the structure and integrity of the newly loaded data.
  - This information helps us to plan the next steps in data processing, such as identifying columns that require cleaning, converting data types, or handling missing values.

These two lines of code represent the start of our data analysis, providing a solid foundation for the data cleaning and exploration tasks that will follow.


In [3]:
covid_df = pd.read_csv(input_covid_file)
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Columns: 1147 entries, Province/State to 3/9/23
dtypes: float64(2), int64(1143), object(2)
memory usage: 2.5+ MB



### Result of `covid_df.info()`

The output from executing `covid_df.info()` on our DataFrame `covid_df` provides valuable information about the structure and content of the COVID-19 data:

- **Class Type:** The DataFrame is of type `<class 'pandas.core.frame.DataFrame'>`, confirming that the data is stored in the standard DataFrame structure of Pandas.

- **Row Index:** `RangeIndex: 289 entries, 0 to 288` indicates that the DataFrame contains 289 rows, starting at index 0 and ending at index 288. This gives an idea of the data volume.

- **Columns:** There are `1147` columns in the DataFrame. The first column mentioned is `Province/State`, and the last is a date `3/9/23`. This suggests that the data includes multiple columns, likely representing time series of confirmed COVID-19 cases.

- **Data Types:**
  - `float64(2)`: There are 2 columns with float data types (decimal numbers).
  - `int64(1143)`: The majority of the columns, 1143 in total, are of the integer type, which is consistent with the counting of confirmed cases.
  - `object(2)`: There are 2 columns categorized as 'object', which are typically strings or mixed data.

- **Memory Usage:** `memory usage: 2.5+ MB` indicates that the DataFrame occupies approximately 2.5 MB in memory. This is a useful metric for assessing memory storage efficiency and can influence the selection of data processing methods.





### Viewing the First Rows of the COVID-19 DataFrame

- `covid_df.head()`:
  - This method is used to display the first five rows of the `covid_df` DataFrame, which contains the COVID-19 data.
  - It is a common practice in data exploration to get a quick understanding of the data format, the included columns, and the style of the recorded data.
  - Viewing the first few rows helps to confirm that the data has been loaded correctly and provides a preliminary view of the data structure, including column names, data types, and potential patterns or inconsistencies that might require attention in data cleaning and processing.

The output of this command will be crucial for our initial data processing decisions, allowing us to adequately plan the next stages of cleaning, transforming, and analyzing the data.



In [4]:
covid_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,2/28/23,3/1/23,3/2/23,3/3/23,3/4/23,3/5/23,3/6/23,3/7/23,3/8/23,3/9/23
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,209322,209340,209358,209362,209369,209390,209406,209436,209451,209451
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,334391,334408,334408,334427,334427,334427,334427,334427,334443,334457
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,271441,271448,271463,271469,271469,271477,271477,271490,271494,271496
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,47866,47875,47875,47875,47875,47875,47875,47875,47890,47890
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,105255,105277,105277,105277,105277,105277,105277,105277,105288,105288


### Output of `covid_df.head()`

The output from `covid_df.head()` shows the first five rows of our COVID-19 dataset:

- **Column Descriptions:**
  - `Province/State`: This column contains the names of provinces or states. It appears to have many missing values (NaN), indicating that the data may be reported at the country level for these entries.
  - `Country/Region`: The country or region to which the data row corresponds.
  - `Lat` and `Long`: These columns represent the latitude and longitude coordinates of the country or region.
  - Date Columns (`1/22/20`, `1/23/20`, ..., `3/9/23`): Each of these columns represents the number of confirmed COVID-19 cases on a specific date. The dataset appears to be a time series starting from January 22, 2020, to March 9, 2023.

- **Row Examples:**
  - The first row corresponds to Afghanistan, with latitude and longitude values and a time series of confirmed cases from January 22, 2020, to March 9, 2023.
  - Similar patterns are observed for Albania, Algeria, Andorra, and Angola, with the progression of confirmed cases over time.

- **Initial Observations:**
  - The dataset is comprehensive, covering a wide date range and including many countries.
  - The presence of NaN values in the `Province/State` column might require attention, depending on the analysis's goals.
  - The data is primarily integer counts of confirmed cases, with geographical coordinates provided for each country or region.




## Process data


The following block of code represents a series of data transformation operations applied to the original `covid_df` DataFrame to create a new `processed_df` DataFrame, which is more structured and prepared for analysis:

- `processed_df = (`
  - We are defining `processed_df` as the result of a chain of methods applied to `covid_df`.

- `.select_columns(["Country/Region", "*/*/*"])`
  - We use the `select_columns` method to select specific columns from the DataFrame. We are selecting the `Country/Region` column and all columns that follow a date pattern (indicated by `*/*/*`), which  means all the time-series columns.

- `.pivot_longer(index="Country/Region", names_to="date")`
  - We apply `pivot_longer` to transform the DataFrame from a wide format to a long format. This method places each date as a row instead of a column, facilitating time-series analysis. The `Country/Region` column is kept as an index.

- `.transform_column("date", pd.to_datetime)`
  - We transform the `date` column to a date and time format using `pd.to_datetime`. This is essential for later manipulations and analyses that require date-based operations.

- `.clean_names()`
  - Finally, `clean_names` is used to normalize and clean up the column names, ensuring consistency and improving the readability of the DataFrame.



- `processed_df.head()`
  - This method will show us the first five rows of the `processed_df` DataFrame, providing a quick view of the result of the transformation operations applied.

This transformation process is crucial for preparing the data for more complex and efficient analyses, as it facilitates the handling of time series and ensures that the data is in a suitable and consistent format.


In [5]:
processed_df = (
    covid_df
    .select_columns(["Country/Region", "*/*/*"])
    .pivot_longer(
        index="Country/Region",
        names_to="date"
    )
    .transform_column("date", pd.to_datetime)
    .clean_names()
)

processed_df.head()

Unnamed: 0,country_region,date,value
0,Afghanistan,2020-01-22,0
1,Albania,2020-01-22,0
2,Algeria,2020-01-22,0
3,Andorra,2020-01-22,0
4,Angola,2020-01-22,0


### Output of `processed_df.head()`

The output from `processed_df.head()` presents the first five rows of our transformed DataFrame `processed_df`, which now reflects the structured format suitable for detailed analysis:

- **Columns Description:**
  - `country_region`: This column lists the country or region names. The transformation has standardized the column name for clarity and consistency.
  - `date`: Represents the date for the corresponding data entry. The transformation process has converted this column to a proper date format, which is evident from the standardized date entries (e.g., `2020-01-22`).
  - `value`: This column shows the number of confirmed COVID-19 cases. The pivot operation has transformed the original wide format (where each date was a separate column) into this long format, placing the count of confirmed cases in a single column.

- **First Five Rows:**
  - The rows display data for `Afghanistan`, `Albania`, `Algeria`, `Andorra`, and `Angola` for the date `2020-01-22`.
  - Each row corresponds to the confirmed COVID-19 case count for each country on that date, which, in these cases, is `0`.

- **Initial Observations:**
  - The dataset now provides a streamlined view, with each row representing a single date entry for a country, making it easier to perform time-series analysis.
  - The cleanliness and structure of the dataset have been significantly improved, with clear, consistent column names and well-organized data.

This format of the data is highly beneficial for subsequent analyses, as it allows for more straightforward manipulation and analysis, particularly when dealing with time-series data across different countries or regions.


## Save output data



- `output_covid_file = path.data_processed_dir("time_series_covid19_confirmed_global_processed.csv")`:
  - In this line, we are defining the variable `output_covid_file`, which will store the output path for the processed COVID-19 data file.
  - We use the `data_processed_dir` function from the `path` module to obtain the path to the specific directory intended for storing processed data within the project's structure.
  - The file name `"time_series_covid19_confirmed_global_processed.csv"` indicates that we will save the processed data in a CSV file. 
Specifying this output path is an important step in organizing and managing data files within the project, ensuring that the processed data is easily accessible and well-organized for use in subsequent analyses.


In [6]:
output_covid_file = path.data_processed_dir("time_series_covid19_confirmed_global_processed.csv")

### Exporting the Processed Data to a CSV File

- `processed_df.to_csv(output_covid_file, index=False)`:
  - This line of code performs the final action of exporting the `processed_df` DataFrame to a CSV file.
  - We use the `to_csv` method from Pandas, which is an efficient and straightforward way to save DataFrames in CSV format.
  - `output_covid_file` is used as the file path argument, indicating where the CSV file will be saved. This is the path we defined earlier, ensuring that the data is stored in the correct location within our project's structure.
  - The parameter `index=False` is included to indicate that we do not want to save the DataFrame's index in the CSV file. This is commonly preferred to keep data files clean and focused exclusively on the data, without additional index columns.




In [7]:
processed_df.to_csv(output_covid_file, index=False)

This final step completes the data handling process in this notebook, ensuring that the processed data is available for immediate use in subsequent analyses or for sharing with other stakeholders of the project.