## Data collection and manipulation

<p>Import Python libraries and rocket launch data</p>
You now have a goal: Is a launch likely to happen given specific weather conditions? You have a data set that contains weather data from:

- Several successful launches 
- One pushed launch day 
- The days leading up to and following each launch

Tools
- pandas jupyter seaborn scikit-learn keras tensorflow

Setup environment:
 - conda create -n devintrods python=3.10 pandas jupyter seaborn scikit-learn keras tensorflow
Activate environment
 - conda activate devintrods
 - pip install -U azureml-train-automl
VS Code tools
- Azure Account
- Azure Machine Learning
  

Visual Studio Code, Python, scikit-learn, and Azure.

In [17]:
import pandas as pd
import numpy as np

# # Sklearn library contains all the machine learning packages we need to digest and extract patterns from the data
from sklearn import linear_model, model_selection, metrics
from sklearn.model_selection import train_test_split

# Machine learning libraries used to build a decision tree
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Sklearn's preprocessing library is used for processing and cleaning the data 
from sklearn import preprocessing

# for visualizing the tree
import pydotplus
from IPython.display import Image 


Read data into a variable

In [19]:
launch_data = pd.read_excel('data/RocketLaunchDataCompleted.xlsx')
launch_data.head()

Unnamed: 0,Name,Date,Time (East Coast),Location,Crewed or Uncrewed,Launched?,High Temp,Low Temp,Ave Temp,Temp at Launch Time,...,Max Wind Speed,Visibility,Wind Speed at Launch Time,Hist Ave Max Wind Speed,Hist Ave Visibility,Sea Level Pressure,Hist Ave Sea Level Pressure,Day Length,Condition,Notes
0,,1958-12-04,,Cape Canaveral,,,75.0,68.0,71.0,,...,16.0,15.0,,,,30.22,,10:26:00,Cloudy,
1,,1958-12-05,,Cape Canaveral,,,78.0,70.0,73.39,,...,14.0,10.0,,,,30.2,,10:26:00,Cloudy,
2,Pioneer 3,1958-12-06,01:45:00,Cape Canaveral,Uncrewed,Y,73.0,0.0,60.21,62.0,...,15.0,10.0,11.0,,,30.25,,10:25:00,Cloudy,
3,,1958-12-07,,Cape Canaveral,,,76.0,57.0,66.04,,...,10.0,10.0,,,,30.28,,10:25:00,Partly Cloudy,
4,,1958-12-08,,Cape Canaveral,,,79.0,60.0,70.52,,...,12.0,10.0,,,,30.23,,12:24:00,Partly Cloudy,


Begin exploring data

In [20]:
launch_data.describe()

Unnamed: 0,High Temp,Low Temp,Ave Temp,Temp at Launch Time,Hist High Temp,Hist Low Temp,Hist Ave Temp,Percipitation at Launch Time,Hist Ave Percipitation,Max Wind Speed,Visibility,Wind Speed at Launch Time,Hist Ave Max Wind Speed,Hist Ave Visibility,Hist Ave Sea Level Pressure
count,299.0,299.0,299.0,59.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,59.0,0.0,0.0,0.0
mean,81.394649,38.745819,69.747124,75.101695,81.852843,62.87291,72.3899,0.063043,0.413478,16.842809,12.929766,10.59322,,,
std,9.0267,33.42309,10.867407,10.471134,6.860432,8.806109,7.825282,0.211995,4.676693,4.70171,6.044445,4.672438,,,
min,51.0,0.0,29.04,50.0,71.0,49.0,60.0,0.0,0.06,8.0,7.0,2.0,,,
25%,77.0,0.0,63.05,70.0,75.0,55.0,65.0,0.0,0.08,14.0,10.0,7.0,,,
50%,82.0,51.0,71.61,77.0,82.0,64.0,72.0,0.0,0.11,16.0,10.0,10.0,,,
75%,88.0,72.0,78.53,81.5,88.0,73.0,80.0,0.0,0.2,18.0,15.0,12.5,,,
max,99.0,83.0,90.79,98.0,91.0,79.0,82.0,1.8,81.0,60.0,80.0,26.0,,,


In [21]:
launch_data.columns

Index(['Name', 'Date', 'Time (East Coast)', 'Location', 'Crewed or Uncrewed',
       'Launched?', 'High Temp', 'Low Temp', 'Ave Temp', 'Temp at Launch Time',
       'Hist High Temp', 'Hist Low Temp', 'Hist Ave Temp',
       'Percipitation at Launch Time', 'Hist Ave Percipitation',
       'Wind Direction', 'Max Wind Speed', 'Visibility',
       'Wind Speed at Launch Time', 'Hist Ave Max Wind Speed',
       'Hist Ave Visibility', 'Sea Level Pressure',
       'Hist Ave Sea Level Pressure', 'Day Length', 'Condition', 'Notes'],
      dtype='object')

Exercise - Clean weather data to analyze rocket launch criteria

## Data cleaning
The first step in cleaning your data is to replace all missing values with something

In [22]:
# Get overview of data
launch_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 26 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Name                          60 non-null     object        
 1   Date                          300 non-null    datetime64[ns]
 2   Time (East Coast)             59 non-null     object        
 3   Location                      300 non-null    object        
 4   Crewed or Uncrewed            60 non-null     object        
 5   Launched?                     60 non-null     object        
 6   High Temp                     299 non-null    float64       
 7   Low Temp                      299 non-null    float64       
 8   Ave Temp                      299 non-null    float64       
 9   Temp at Launch Time           59 non-null     float64       
 10  Hist High Temp                299 non-null    float64       
 11  Hist Low Temp                 29

### Here are a few ways we'll clean the data:

- The rows that don't have Y in the Launched column didn't have a rocket launch, so make those missing values N.
- For rows missing information on whether the rocket was crewed or uncrewed, assume uncrewed. Uncrewed is more likely because there were fewer crewed missions. 
- For missing wind direction, mark it as unknown.
- For missing condition data, assume it was a typical day and use fair.
- For any other data, use a value of 0.

In [23]:
launch_data['Launched?'].fillna('N', inplace=True)
launch_data['Crewed or Uncrewed'].fillna('Uncrewed', inplace=True)
launch_data['Wind Direction'].fillna('unknown', inplace=True)
launch_data['Condition'].fillna('Fair',inplace=True)
launch_data.fillna(0,inplace=True)
launch_data.head()


Unnamed: 0,Name,Date,Time (East Coast),Location,Crewed or Uncrewed,Launched?,High Temp,Low Temp,Ave Temp,Temp at Launch Time,...,Max Wind Speed,Visibility,Wind Speed at Launch Time,Hist Ave Max Wind Speed,Hist Ave Visibility,Sea Level Pressure,Hist Ave Sea Level Pressure,Day Length,Condition,Notes
0,0,1958-12-04,0,Cape Canaveral,Uncrewed,N,75.0,68.0,71.0,0.0,...,16.0,15.0,0.0,0.0,0.0,30.22,0.0,10:26:00,Cloudy,0
1,0,1958-12-05,0,Cape Canaveral,Uncrewed,N,78.0,70.0,73.39,0.0,...,14.0,10.0,0.0,0.0,0.0,30.2,0.0,10:26:00,Cloudy,0
2,Pioneer 3,1958-12-06,01:45:00,Cape Canaveral,Uncrewed,Y,73.0,0.0,60.21,62.0,...,15.0,10.0,11.0,0.0,0.0,30.25,0.0,10:25:00,Cloudy,0
3,0,1958-12-07,0,Cape Canaveral,Uncrewed,N,76.0,57.0,66.04,0.0,...,10.0,10.0,0.0,0.0,0.0,30.28,0.0,10:25:00,Partly Cloudy,0
4,0,1958-12-08,0,Cape Canaveral,Uncrewed,N,79.0,60.0,70.52,0.0,...,12.0,10.0,0.0,0.0,0.0,30.23,0.0,12:24:00,Partly Cloudy,0


In [26]:
launch_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 26 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Name                          300 non-null    object        
 1   Date                          300 non-null    datetime64[ns]
 2   Time (East Coast)             300 non-null    object        
 3   Location                      300 non-null    object        
 4   Crewed or Uncrewed            300 non-null    object        
 5   Launched?                     300 non-null    object        
 6   High Temp                     300 non-null    float64       
 7   Low Temp                      300 non-null    float64       
 8   Ave Temp                      300 non-null    float64       
 9   Temp at Launch Time           300 non-null    float64       
 10  Hist High Temp                300 non-null    float64       
 11  Hist Low Temp                 30

### Data Manipulation

In [27]:
# As part of the data cleaning process, we have to convert text data to 
# numerical because computers understand only numbers
label_encoder = preprocessing.LabelEncoder()

# Three columns have categorical text info, and we convert them to numbers
launch_data['Crewed or Uncrewed']=label_encoder.fit_transform(launch_data['Crewed or Uncrewed'])
launch_data['Wind Direction'] = label_encoder.fit_transform(launch_data['Wind Direction'])
launch_data['Condition'] = label_encoder.fit_transform(launch_data['Condition'])

In [29]:
launch_data.head()

Unnamed: 0,Name,Date,Time (East Coast),Location,Crewed or Uncrewed,Launched?,High Temp,Low Temp,Ave Temp,Temp at Launch Time,...,Max Wind Speed,Visibility,Wind Speed at Launch Time,Hist Ave Max Wind Speed,Hist Ave Visibility,Sea Level Pressure,Hist Ave Sea Level Pressure,Day Length,Condition,Notes
0,0,1958-12-04,0,Cape Canaveral,1,N,75.0,68.0,71.0,0.0,...,16.0,15.0,0.0,0.0,0.0,30.22,0.0,10:26:00,0,0
1,0,1958-12-05,0,Cape Canaveral,1,N,78.0,70.0,73.39,0.0,...,14.0,10.0,0.0,0.0,0.0,30.2,0.0,10:26:00,0,0
2,Pioneer 3,1958-12-06,01:45:00,Cape Canaveral,1,Y,73.0,0.0,60.21,62.0,...,15.0,10.0,11.0,0.0,0.0,30.25,0.0,10:25:00,0,0
3,0,1958-12-07,0,Cape Canaveral,1,N,76.0,57.0,66.04,0.0,...,10.0,10.0,0.0,0.0,0.0,30.28,0.0,10:25:00,6,0
4,0,1958-12-08,0,Cape Canaveral,1,N,79.0,60.0,70.52,0.0,...,12.0,10.0,0.0,0.0,0.0,30.23,0.0,12:24:00,6,0
