# UFO Sightings

#### The objective of this assignment is for you to explain what is happening in each cell in clear, understandable language. 

#### _There is no need to code._ The code is there for you, and it already runs. Your task is only to explain what each line in each cell does.

#### The placeholder cells should describe what happens in the cell below it.

**Example**: The cell below imports `pandas` as a dependency because `pandas` functions will be used throughout the program, such as the Pandas `DataFrame` as well as the `read_csv` function.

In [1]:
import pandas as pd

This cell connects to the CSV file that has all the data to be processed. 
Then a new dataframe is created by passing the file to the Pandas reader which is necessary to apply Pandas functions.

In [2]:
csv_path = "Resources/ufoSightings.csv"

ufo_df = pd.read_csv(csv_path)

ufo_df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


The dataframe.count function will list the count of all the values in each column. This is a tool to understand the size of the dataframe and to detect any missing values between the columns. In this case there are many missing values across the columns. 

In [3]:
ufo_df.count()

datetime                80332
city                    80332
state                   74535
country                 70662
shape                   78400
duration (seconds)      80332
duration (hours/min)    80332
comments                80317
date posted             80332
latitude                80332
longitude               80332
dtype: int64

Becuase we detected missing values in the previous functions, we start with the dropna function to drop rows that has any missing value, which may be an exreme choice. 

In [8]:
clean_ufo_df = ufo_df.dropna(how="any")
clean_ufo_df.count()



datetime                66516
city                    66516
state                   66516
country                 66516
shape                   66516
duration (seconds)      66516
duration (hours/min)    66516
comments                66516
date posted             66516
latitude                66516
longitude               66516
dtype: int64

This is a  list of columns, this cell list the columns that should be in the table, and those that are no listed are dropped from the table.This list has no values.

The second command set a condition that only the rows that has US in the "Country" column will be selected and by defining the columns in the new dataframe function, only the columns that are listed will be included  in the  usa_ufo_df dataframe.

In [17]:
columns = [
    "datetime",
    "city",
    "state",
    "country",
    "shape",
    "duration (seconds)",
    "duration (hours/min)",
    "comments",
    "date posted"
]

usa_ufo_df = clean_ufo_df.loc[clean_ufo_df["country"] == "us", columns]
usa_ufo_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004
5,10/10/1961 19:00,bristol,tn,us,sphere,300,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007
7,10/10/1965 23:45,norwalk,ct,us,disk,1200,20 minutes,A bright orange color changing to reddish colo...,10/2/1999


The value_counts() function is used to count the number of values mentioned in the column. We know that CA is mentioned 8683 times which suggest that there are sightings recorded in CA state.

In [25]:
state_counts = usa_ufo_df["state"].value_counts()
state_counts.head()

ca    8683
fl    3754
wa    3707
tx    3398
ny    2915
Name: state, dtype: int64

This cell creates a new dataframe with the name of state_ufo_counts_df

In [32]:
state_ufo_counts_df = pd.DataFrame(state_counts)
state_ufo_counts_df

Unnamed: 0,state
ca,8683
fl,3754
wa,3707
tx,3398
ny,2915
il,2447
az,2362
pa,2319
oh,2251
mi,1781


This cell renames the column to Sum of Sightings using the dataframe.rename() function

In [33]:
state_ufo_counts_df = state_ufo_counts_df.rename(
    columns={"state": "Sum of Sightings"})
state_ufo_counts_df.head()

Unnamed: 0,Sum of Sightings
ca,8683
fl,3754
wa,3707
tx,3398
ny,2915


This cell returns the data type of the columns in the dataframe

In [34]:
usa_ufo_df.dtypes

datetime                object
city                    object
state                   object
country                 object
shape                   object
duration (seconds)      object
duration (hours/min)    object
comments                object
date posted             object
dtype: object

This cell change the data type of the "duration (seconds)" from a string to a float type.

In [36]:
usa_ufo_df.loc[:, "duration (seconds)"] = usa_ufo_df["duration (seconds)"].astype("float")
usa_ufo_df.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)      float64
duration (hours/min)     object
comments                 object
date posted              object
dtype: object

This cell returns the total sum of the column "duration (seconds)"

In [37]:
# Now it is possible to find the sum of seconds
usa_ufo_df["duration (seconds)"].sum()

351281285.38

This cell group the dataframe under two categories, State then City. And then counting the number of values in the ['datetime']

In [38]:
grouped_data = usa_ufo_df.groupby(['state', 'city'])

# Hint: If you are counting records, you can use any column and get the same result. Try it.
grouped_data['datetime'].count()

state  city                                                     
ak     adak                                                          1
       anchor point                                                  1
       anchorage                                                    82
       angoon                                                        1
       auke bay                                                      2
       bethel                                                        8
       big lake                                                      1
       butte                                                         1
       chugiak                                                       2
       clam gulch                                                    1
       cold bay                                                      1
       cordova                                                       2
       council                                                       1
       craig