<a href="https://colab.research.google.com/github/natnew/Python-Projects-Collecting-and-Manipulating-Data/blob/main/Collecting_and_Manipulating_For_Weather_in_London.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

Weather is tracked and analyzed every day to help airplanes have safe flights. Many weather conditions must be monitored to ensure that the likelihood of something negative happening to the aircraft is as low as possible. Can this same practice be aplied to weather data and the health sector?(Microsoft 2021)

This project will look at:



* Conditions (cloudy, partly cloudy, fair, rain, thunder, heavy storm)
* Temperature
* Humidity
* Wind speed
* Wind direction
* Precipitation
* Visibility
* Sea level
* Pressure
* Most common Health Conditions

We will compare weather data in London(Heathrow) across three different years to see if there are any noticeable trends. 


## Questions

"Has the weather changed in London over the past 70+ years?"

"Will the weather in this area at this time cause any potential health issues for the locals?"

## Collect Data

The data was collected from Met office website and contained a large amout of data from 1948 to 2021. For the purpose of this project, we will only compare data from 1948 and 2021 and 2000.

You can find the data here: <br>
https://www.wunderground.com/history/daily/EGLC/date/2021-6-1 <br>

## Missing Data

The Excel files have extensive data about the weather in each year. However, as I start to explore this data, I might find a significant problem.There are instances where no data was captured.



# Import Libraries

In [1]:
# Pandas library is used for handling tabular data
import pandas as pd

# NumPy is used for handling numerical series operations (addition, multiplication, and ...)

import numpy as np
# Sklearn library contains all the machine learning packages we need to digest and extract patterns from the data
from sklearn import linear_model, model_selection, metrics
from sklearn.model_selection import train_test_split

# Machine learning libraries used to build a decision tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Sklearn's preprocessing library is used for processing and cleaning the data 
from sklearn import preprocessing

# for visualizing the tree
import pydotplus
from IPython.display import Image

# Read Data into a variable

In [3]:
weather_data_1948 = pd.read_excel('WeatherDataCompleted1948.xlsx')
weather_data_1948.head()

Unnamed: 0,Year,Month,Column1.4,Column1.5,Column1.6,Column1.7,Column1.8,Column1.9,Column1.10,Column2,Column3
0,1948,1,8.0,9,3.0,3 ---,85.0,0 ---,,,
1,1948,2,7.0,9,2.0,2 ---,26.0,0 ---,,,
2,1948,3,14.0,2,3.0,8 ---,14.0,0 ---,,,
3,1948,4,15.0,4,5.0,1 ---,35.0,0 ---,,,
4,1948,5,18.0,1,6.0,9 ---,57.0,0 ---,,,


As you can see from the data, we have some redundant columns whiich need to be removed. They ended up in the data via the data collection process. This will be discussed later. 

# Explore Data

In [4]:
weather_data_1948.columns

Index(['Year', 'Month', 'Column1.4', 'Column1.5', 'Column1.6', 'Column1.7',
       'Column1.8', 'Column1.9', 'Column1.10', 'Column2', 'Column3'],
      dtype='object')

# Data Cleaning

In [5]:
weather_data_1948.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Year        887 non-null    int64  
 1   Month       887 non-null    int64  
 2   Column1.4   887 non-null    float64
 3   Column1.5   887 non-null    object 
 4   Column1.6   887 non-null    float64
 5   Column1.7   887 non-null    object 
 6   Column1.8   887 non-null    float64
 7   Column1.9   887 non-null    object 
 8   Column1.10  875 non-null    object 
 9   Column2     0 non-null      float64
 10  Column3     0 non-null      float64
dtypes: float64(5), int64(2), object(4)
memory usage: 76.4+ KB


Observation: <br>
* Column2 and Column3 have no data. 
* All the other columns appear to have 887 non-null values. This is not entirely correct because we know that each column has 12 values.

Ways that data will be cleaned: <br>
* For missing information, mark it as unknown.
* For missing condition data, assume it was a typical day and use fair.
* For any other data, use a value of 0.

In [6]:
## To handle missing values, we will fill the missing values with appropriate values - But we do not need to do this for this dataset at the moment.
#weather_data_1948['ColumnName'].fillna('N',inplace=True)
#weather_data_1948['ColumnName'].fillna('Uncrewed',inplace=True)
#weather_data_1948['ColumnName'].fillna('unknown',inplace=True)
#weather_data_1948['ColumnName'].fillna('Fair',inplace=True)
#weather_data_1948.fillna(0,inplace=True)
#weather_data_1948.head()

In [7]:
weather_data_1948.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 887 entries, 0 to 886
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Year        887 non-null    int64  
 1   Month       887 non-null    int64  
 2   Column1.4   887 non-null    float64
 3   Column1.5   887 non-null    object 
 4   Column1.6   887 non-null    float64
 5   Column1.7   887 non-null    object 
 6   Column1.8   887 non-null    float64
 7   Column1.9   887 non-null    object 
 8   Column1.10  875 non-null    object 
 9   Column2     0 non-null      float64
 10  Column3     0 non-null      float64
dtypes: float64(5), int64(2), object(4)
memory usage: 76.4+ KB


Observation: We now have a clearer datset to work with. 

# Data Manipulation

In [8]:
## As part of the data cleaning process, we have to convert text data to numerical because computers understand only numbers - 
## We do not have any text values in this dataset so we can skip this part
#label_encoder = preprocessing.LabelEncoder()

# Three columns have categorical text info, and we convert them to numbers
#weather_data_1948['ColumnName'] = label_encoder.fit_transform(launch_data['ColumnName'])
#weather_data_1948['ColumnName'] = label_encoder.fit_transform(launch_data['ColumnName'])
#weather_data_1948['ColumnName'] = label_encoder.fit_transform(launch_data['ColumnName'])

In [9]:
weather_data_1948.head()

Unnamed: 0,Year,Month,Column1.4,Column1.5,Column1.6,Column1.7,Column1.8,Column1.9,Column1.10,Column2,Column3
0,1948,1,8.0,9,3.0,3 ---,85.0,0 ---,,,
1,1948,2,7.0,9,2.0,2 ---,26.0,0 ---,,,
2,1948,3,14.0,2,3.0,8 ---,14.0,0 ---,,,
3,1948,4,15.0,4,5.0,1 ---,35.0,0 ---,,,
4,1948,5,18.0,1,6.0,9 ---,57.0,0 ---,,,


Observation: We have data in a format that can be explored, manipulated and presented.

# Further Exploration

Ways that the data exploration journay can be extended include:<br>
* Explore the data further: Look up articles and reports on weather data/climate change and the impacts on health.
* Explore the missing weather data: Beyond individual days, were there seasons that that showed a dramatic change in values? What kind of weather profile do those seasons tend to have?
* Explore other data manipulations: Could we have used better values to fill in missing data?
* Evaluate similar problems: Are there similar problems that you can use to help fill in this data? For example, are airplane delays because of weather in the area also an indicator?

#Next Step

The next step of the process is to do the same for another set of datasets for the year 2021 and 2000.

Tip: Refer to the notebook about rocket launches for assistance.