# UFO Sightings Through the Data Science Pipeline 

By Melissa Chen and Andrea Soto

## Introduction

Most of us do not realize that data science plays a role in various aspects of our daily lives. For example, data sience can have applications in the areas of marketing and advertising, healthcare, government, image recognition, cybersecuirty and many more. Data science is a rapidly rising, interdisciplinary field that involves an understanding of various scientific, mathematical, computational and design techniques in order to extract insight from large amounts of data to solve real world problems. More and more industries are noticing the importance of this field for making smart, informed, and even profitable everyday decisions. Since there is an increasing need for knowledgeable and able data scientists, it is important for more people to acquire the necessary skills so that they can confidently enter this exciting and growing field. 

In this introductory tutorial we will choose a dataset and walk through the steps of the data science pipeline in manner that is easy to understand and follow. The pipeline consists of the following steps: 

1. Data Collection 
2. Data Processing 
3. Exploratory Analysis and Visualization 
4. Hypothesis Testing and Machine Learning 
5. Insight and Policy 

This tutorial can help those with an understanding of scripting languages but who are new to the data science methodology as a whole. It can also help more experieced data scientists who are perhaps unfamiliar with Python and the various libraries availble that simplify the data analysis process.

By the end of the tutorial, you will have learned how to search for and derive meaningful information from a large dataset and convey your findings in a visually, insightful way.

## Table of Contents:

1. [Lab Set-Up](#lab)
2. [Motivation](#mot)
3. [Data Collection](#coll)
4. [Data Processing](#proc)
5. [Exploratory Analysis and Visualization](#expl)
6. [Hypothesis Testing and Machine Learning](#hyp)
7. [Insights](#ins)
8. [Summary](#sum)

## Lab Set-Up
<a id='lab'></a>

First we need to set up your machine. The easiest way to do this is to download [Anaconda](https://www.anaconda.com/download/#macos), a popular Python data science platform, with Python version 3.6. We recommend using [Jupyter Notebook](https://jupyter-notebook.readthedocs.io/en/stable/) to create documents where you can run your code and easily view data visualizations. This open-source application makes it especially easy for beginners to learn and comes with the Anaconda package. Simply run the command `jupyter notebook` in the OSX/Linux terminal or Windows Command Prompt to launch the Jupyter interface. Jupyter allows the use of [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) language to easily document your code and make your documents more presentable. 

Create a new directory for this tutorial and within it, create a jupyter notebook and download the [UFO Sightings](https://www.kaggle.com/NUFORC/ufo-sightings/) dataset. We will be using the following Python libraries:

1. [Pandas](https://pandas.pydata.org/pandas-docs/stable/)
2. [Numpy](https://docs.scipy.org/doc/numpy/)
3. [Matplotlib](https://matplotlib.org/)
4. [Seaborn](https://seaborn.pydata.org/)
5. [sklearn](http://scikit-learn.org/stable/)
6. [Folium](https://github.com/python-visualization/folium)

Folium is not included with the Anaconda package so you will have to install it by running the `pip install folium` command in the terminal/command prompt. Below shows the proper way to import all the necessary libraries.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
import sklearn.metrics as skl
import folium

## Motivation 
<a id='mot'></a>


## Data Collection
<a id='coll'></a>
