In [None]:
import pandas as pd
from scipy.io import arff

rain_dataset = pd.read_csv("rain.csv")
salaries_dataset = arff.loadarff("salaries.arff")

: 

# Dataset Description Report

## Group Members
- [Member Name 1]
- [Member Name 2]
- [Member Name 3]

## Introduction
This report provides a detailed description of two datasets used for analysis: a weather dataset predicting rainfall and a salary dataset from Montgomery County, MD. Each dataset's characteristics, attributes, and significance are discussed.


## Dataset 1: Weather Observations Dataset

### Context
The first dataset comprises daily weather observations collected over approximately ten years from various locations across Australia. This dataset enables the prediction of next-day rain, answering the crucial question of whether to carry an umbrella.

### Characteristics
- **Samples**: Approximately 10 years of daily observations (exact count of samples may vary).
- **Attributes**: 22 attributes, including both numerical and categorical types.

### Attribute Types and Unique Values
| Attribute          | Type     | Unique Values | Missing Values | Description                                                  |
|--------------------|----------|---------------|----------------|--------------------------------------------------------------|
| Date               | Object   | 3436          | 0              | The date of observation.                                     |
| Location           | Object   | 49            | 0              | The name of the weather station location.                   |
| MinTemp            | Float    | 389           | 1485           | Minimum temperature in degrees Celsius.                      |
| MaxTemp            | Float    | 505           | 1261           | Maximum temperature in degrees Celsius.                      |
| Rainfall           | Float    | 681           | 3261           | Amount of rainfall recorded for the day in mm.              |
| Evaporation        | Float    | 358           | 62790          | Class A pan evaporation (mm) in the 24 hours to 9am.       |
| Sunshine           | Float    | 145           | 69835          | Number of hours of bright sunshine in the day.              |
| WindGustDir        | Object   | 16            | 10326          | Direction of the strongest wind gust in the last 24 hours.  |
| WindGustSpeed      | Float    | 67            | 10263          | Speed (km/h) of the strongest wind gust in the last 24 hours. |
| WindDir9am         | Object   | 16            | 10566          | Direction of the wind at 9am.                               |
| WindDir3pm         | Object   | 16            | 4228           | Direction of the wind at 3pm.                               |
| WindSpeed9am       | Float    | 43            | 1767           | Wind speed (km/h) averaged over 10 minutes prior to 9am.   |
| WindSpeed3pm       | Float    | 44            | 3062           | Wind speed (km/h) averaged over 10 minutes prior to 3pm.   |
| Humidity9am        | Float    | 101           | 2654           | Humidity (percent) at 9am.                                  |
| Humidity3pm        | Float    | 101           | 4507           | Humidity (percent) at 3pm.                                  |
| Pressure9am        | Float    | 546           | 15065          | Atmospheric pressure (hPa) at 9am.                          |
| Pressure3pm        | Float    | 549           | 15028          | Atmospheric pressure (hPa) at 3pm.                          |
| Cloud9am           | Float    | 10            | 55888          | Fraction of sky obscured by cloud at 9am (in oktas).        |
| Cloud3pm           | Float    | 10            | 1767           | Fraction of sky obscured by cloud at 3pm (in oktas).        |
| Temp9am            | Float    | 441           | 3609           | Temperature (degrees C) at 9am.                             |
| Temp3pm            | Float    | 502           | 1485           | Temperature (degrees C) at 3pm.                             |
| RainToday          | Object   | 2             | 3261           | Yes if precipitation (mm) exceeds 1mm, otherwise No. |
| RainTomorrow       | Object   | 2             | 3267           | Indicates whether it will rain tomorrow (Yes/No).           |


### Comments on Reformatting
- **Date**: Should be reformatted to `datetime` for better handling in time series analysis.


### Target Attribute
- **RainTomorrow**: This binary attribute indicates whether it will rain the following day (Yes/No). It is crucial for predicting weather patterns and making informed decisions.

### Importance of Dataset
Understanding the distribution of values in attributes helps in preprocessing steps, such as handling missing values and feature selection, to build accurate predictive models.

In [None]:
# Path to the CSV file with dataset
file_path = "rain.csv"

# Read the CSV file into a pandas DataFrame
df_rain = pd.read_csv('file_path')

# Display the first few rows of the DataFrame
df_rain.head()

---

## Dataset 2: Employee Salary Dataset

### Context
The second dataset provides annual salary information, including gross and overtime pay, for all active, permanent employees of Montgomery County, MD, for the calendar year 2016. This data is essential for analyzing salary distribution and identifying trends in employee compensation.

### Characteristics
- **Samples**: 9222 employee records
- **Attributes**: 10 attributes with a mix of numeric and categorical types.

### Attribute Types and Unique Values
| Attribute                    | Type     | Unique Values | Missing Values | Description                                                  |
|------------------------------|----------|---------------|----------------|--------------------------------------------------------------|
| full_name                    | String   | 9222          | 0              | Employee's full name.                                       |
| gender                       | Nominal  | 2             | 17             | Employee's gender (F, M).                                   |
| current_annual_salary        | Numeric  | 3403          | 0              | Employee's current annual salary in USD.                    |
| 2016_gross_pay_received      | Numeric  | 8977          | 100            | Total gross pay received in 2016.                           |
| 2016_overtime_pay            | Numeric  | 6176          | 2917           | Total overtime pay received in 2016.                        |
| department                   | Nominal  | 37            | 0              | Department of the employee.                                  |
| department_name              | Nominal  | 37            | 0              | Full name of the department.                                 |
| division                     | String   | 694           | 0            | Division of the employee within the department.             |
| assignment_category           | Nominal  | 2             | 0            | Employment category (Fulltime-Regular, Parttime-Regular).   |
| employee_position_title      | String   | 385           | 0            | Job title of the employee.                                   |
| underfilled_job_title        | String   | 84            | 0            | Job title of the position being underfilled, if applicable. |
| date_first_hired            | String   | 2264          | 0            | Date the employee was first hired.                           |
| year_first_hired             | Numeric  | 51            | 0            | Year the employee was first hired.                           |

### Target Attribute
- **current_annual_salary**: The primary target attribute used for salary analysis and modeling. Understanding its distribution is crucial for various analyses, including equity and budgeting.

### Comments on Reformatting
- **date_first_hired**: Should be reformatted to `datetime` for better handling in time analysis.

### Importance of Dataset
The distribution of numeric values in the salary dataset provides insight into compensation trends, while the categorical data (such as gender and department) allows for analyzing disparities and ensuring equitable pay practices.


In [None]:
# Path to ARFF file with dataset
file_path = 'salaries.arff'  # Adjust the path if needed

# Load the ARFF file using liac-arff
with open(file_path, 'r') as file:
    dataset = arff.load(file)

# Extract data and column names (attributes) from the ARFF file
df = pd.DataFrame(dataset['data'], columns=[attr[0] for attr in dataset['attributes']])

# Display the first few rows of the DataFrame
df.head()

---

## Conclusion
Both datasets offer valuable insights into weather prediction and employee compensation. By understanding their characteristics, distributions, and the significance of attributes, we can apply appropriate data preprocessing techniques and build effective predictive models.