# Goal
This is a generic notebook designed to help analyze any dataset. By updating the data source and running the entire notebook, you should be able to derive basic insights about the data before creating a more targeted solution.

## Setup
- pip install jupyter
- pip install pandas
- pip install scikit-learn

## Upload Data

In [31]:
import pandas as pd

dataset = pd.read_csv('datasets/ufos/complete.csv', on_bad_lines='skip') # Update with your data source

  dataset = pd.read_csv('datasets/ufos/complete.csv', on_bad_lines='skip') # Update with your data source


## Explore Data
The following commands will give you a high level overview of the data in the dataset.

In [32]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88679 entries, 0 to 88678
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   datetime              88679 non-null  object 
 1   city                  88679 non-null  object 
 2   state                 81270 non-null  object 
 3   country               76314 non-null  object 
 4   shape                 85757 non-null  object 
 5   duration (seconds)    88677 non-null  object 
 6   duration (hours/min)  85660 non-null  object 
 7   comments              88644 non-null  object 
 8   date posted           88679 non-null  object 
 9   latitude              88679 non-null  object 
 10  longitude             88679 non-null  float64
dtypes: float64(1), object(10)
memory usage: 7.4+ MB


In [33]:
dataset.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


In [34]:
dataset.describe()

Unnamed: 0,longitude
count,88679.0
mean,-85.021836
std,41.421744
min,-176.658056
25%,-112.073333
50%,-87.65
75%,-77.769738
max,178.4419


In [35]:
corr_matrix = dataset.corr(numeric_only=True)
corr_matrix

Unnamed: 0,longitude
longitude,1.0


### Check for Unique Values

In [36]:
def check_unique_values(dataset):
    for column in dataset.columns:
        # Print formatted table of column name and number of unique values
        print(f'{column}: {len(dataset[column].unique())}')
        
check_unique_values(dataset)

datetime: 76159
city: 22018
state: 69
country: 6
shape: 30
duration (seconds): 733
duration (hours/min): 9792
comments: 88284
date posted: 317
latitude: 25428
longitude: 20549


The following function will check the dataset for categorical columns by looking at the ratio of unique values to the overall number of values. If the ratio is less than 0.05, the column is considered categorical.

In [37]:
def check_categorical_columns(dataset) -> list:
    categorical_columns = []
    for column in dataset.columns:
        if len(dataset[column].unique()) / len(dataset[column]) < 0.05:
            categorical_columns.append(column)
    return categorical_columns

categorical_columns=check_categorical_columns(dataset)

For each of the potential category columns, we will use the OrdinalEncoder from scikit-learn to convert the values to numbers.

    for column in categorical_columns:
        try:
            dataset[column + "_cat"] = encoder.fit_transform(dataset[column].values.reshape(-1, 1))
            print("Total categories for " + column + ": " + str(encoder.categories.list))
            print(column + f"({len(encoder.categories_)})): " + str(encoder.categories_))
        except Exception as e:
            print(f'Could not encode {column}: {e}')

In [57]:
def encode_categorical_columns(dataset: pd.DataFrame, categorical_columns: list):
    from sklearn.preprocessing import OrdinalEncoder
    encoder = OrdinalEncoder()
    
    # Convert all categorical columns to strings for consistency
    for column in categorical_columns:
        dataset[column] = dataset[column].astype(str)
        
    encoder.fit(dataset[categorical_columns])
    print(encoder.categories_)
        
encode_categorical_columns(dataset, categorical_columns)

[array(['ab', 'ak', 'al', 'ar', 'az', 'bc', 'ca', 'co', 'ct', 'dc', 'de',
       'fl', 'ga', 'hi', 'ia', 'id', 'il', 'in', 'ks', 'ky', 'la', 'ma',
       'mb', 'md', 'me', 'mi', 'mn', 'mo', 'ms', 'mt', 'nan', 'nb', 'nc',
       'nd', 'ne', 'nf', 'nh', 'nj', 'nm', 'ns', 'nt', 'nv', 'ny', 'oh',
       'ok', 'on', 'or', 'pa', 'pe', 'pq', 'pr', 'qc', 'ri', 'sa', 'sc',
       'sd', 'sk', 'tn', 'tx', 'ut', 'va', 'vi', 'vt', 'wa', 'wi', 'wv',
       'wy', 'yk', 'yt'], dtype=object), array(['au', 'ca', 'de', 'gb', 'nan', 'us'], dtype=object), array(['changed', 'changing', 'chevron', 'cigar', 'circle', 'cone',
       'crescent', 'cross', 'cylinder', 'delta', 'diamond', 'disk',
       'dome', 'egg', 'fireball', 'flare', 'flash', 'formation',
       'hexagon', 'light', 'nan', 'other', 'oval', 'pyramid', 'rectangle',
       'round', 'sphere', 'teardrop', 'triangle', 'unknown'], dtype=object), array(['0', '0.0', '0.001', '0.01', '0.02', '0.05', '0.08', '0.1', '0.2',
       '0.23', '0.3', '0.33', '0

In [58]:
dataset.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude,state_cat,country_cat,shape_cat,date posted_cat
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111,57.0,4.0,8.0,174.0
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,57.0,,19.0,79.0
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667,,3.0,4.0,10.0
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833,57.0,4.0,4.0,7.0
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611,13.0,4.0,19.0,12.0
