# Cleaning data

## Reading Data

- Reading data is the first step to any data analysis task.
- Pandas offers versatile functions to read data from various file formats.
  
  **Example for reading from an Excel file**:
  ```python
  df = pd.read_excel('input.xlsx', sheet_name='Sheet1')
  df
  ```

  **Example for reading from a CSV file**:
  ```python
  df = pd.read_csv('data/raw/input.csv')
  df
  ```

- **File Formats**:
  - Pandas provides functions to read from a wide range of sources, including:
    - Text formats such as CSV, JSON, and HTML. Use `pd.read_csv()`, `pd.read_json()`, and `pd.read_html()` respectively.
    - Binary formats such as Excel, HDF5, and Parquet. Use `pd.read_excel()`, `pd.read_hdf()`, and `pd.read_parquet()` respectively.
    - SQL databases like SQLite, PostgreSQL, and MySQL. For this, a connection needs to be established with the respective database and then use `pd.read_sql_query()` or `pd.read_sql_table()` to fetch the data.

Remember, when reading data, you need to specify the correct filepath to the data that you want to read. The filepath is the location of the file on your computer. If you are using Google Colab, you need to:
1. mount your drive
2. set your working directory
3. specify the filepath to the data on your Google Drive.

----

## Tutorial 3, Part 1: Reading Data Using Pandas

Import the possum dataset as a dataframe and call it `df_pos`. Lets answer the following questions together: 
- How does the data look like - print the first five rows of the dataframe: `print(df_pos.head(5))`
- What are the column names? `columns`
- Print more info about the data? `info()`
- What are the data types of the columns? `dtypes`
- What is the structure of the dataframe? `shape`
- Describe the data - what are the summary statistics of the dataframe? `describe()`
- What is the average age of the possums? `print(df_pos[''].mean())

----
----


Then, continuing from the previous exercise:
- What is the average weight of the possums by sex?
- What is the median age of the possums by site?
- Rename the sites called populations called `other` to `bison`
- Create a dummy variable called `old` that is 1 if the possum is older than 5 years and 0 otherwise
- Calculate the body length of the possums and add it as a new column to the dataframe
- Round weight to one decimal place and replace the weight column with the rounded values
- Create a new column in the possum dataframe called `BMI` that is the body mass index of the possums. The formula for BMI is:

    $$
    BMI = \frac{\text{weight in kg}}{(\text{height in m})^2}
    $$
    
- Wild card: Please make a variable that you think is interesting and add it to the dataframe
- Save your data to a new csv file called `possum_data_cleaned.csv` in the interim data folder

## Writing Data
  - DataFrames can be saved to a variety of file formats.
  - Example for writing to an Excel file:
    ```python
    df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
    ```

- **File Formats**:
  - Pandas supports a variety of file formats including:
    - Text formats such as CSV, JSON, and HTML.
    - Binary formats such as Excel, HDF5, and Parquet.
    - SQL databases like SQLite, PostgreSQL, and MySQL.