# 1. Import Libraries

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Load The Dataset

In [25]:
# Because there are non-UTF characters in the file, we need to specify the encoding
df = pd.read_csv('data/banklist.csv', encoding='latin-1')

# 3. Understand The Data

### Read in basic information about the dataset

In [26]:
print(df.info())
print(df.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Bank Name               568 non-null    object
 1   City                    568 non-null    object
 2   State                   568 non-null    object
 3   Cert                    568 non-null    int64 
 4   Acquiring Institution   568 non-null    object
 5   Closing Date            568 non-null    object
 6   Fund                    568 non-null    int64 
dtypes: int64(2), object(5)
memory usage: 31.2+ KB
None
Bank Name                 0
City                      0
State                     0
Cert                      0
Acquiring Institution     0
Closing Date              0
Fund                      0
dtype: int64


### Check for duplicates

In [27]:
# Check for duplicates
dupes = df.duplicated()
print(dupes.sum())

0


### Understand the unique values for each column

In [31]:
# Understand the unique values in each column
for col in df.columns:
    print(col)
    print(df[col].unique())

['Citizens Bank' 'Heartland Tri-State Bank' 'First Republic Bank'
 'Signature Bank' 'Silicon Valley Bank' 'Almena State Bank'
 'First City Bank of Florida' 'The First State Bank' 'Ericson State Bank'
 'City National Bank of New Jersey' 'Resolute Bank'
 'Louisa Community Bank' 'The Enloe State Bank'
 'Washington Federal Bank for Savings'
 'The Farmers and Merchants State Bank of Argonia' 'Fayette County Bank'
 'Guaranty Bank, (d/b/a BestBank in Georgia & Michigan)' 'First NBC Bank'
 'Proficio Bank' 'Seaway Bank and Trust Company' 'Harvest Community Bank'
 'Allied Bank' 'The Woodbury Banking Company' 'First CornerStone Bank'
 'Trust Company Bank' 'North Milwaukee State Bank'
 'Hometown National Bank' 'The Bank of Georgia' 'Premier Bank'
 'Edgebrook Bank' 'Doral Bank' 'Capitol City Bank & Trust Company'
 'Highland Community Bank' 'First National Bank of Crestview'
 'Northern Star Bank' 'Frontier Bank, FSB D/B/A El Paseo Bank'
 'The National Republic Bank of Chicago' 'NBRS Financial'
 'Gre

#### For Acquiring Intitution, some names had N.A. in them. This does not indicate a null value, but a national association.

## 4. Clean the data

### Remove the /xa0 character from the column names

In [None]:
# Remove the \xa0 character from the column names
name_dict = {
   "Bank Name\xa0" : "Name",
    "City\xa0" : "City",
    "State\xa0" : "State",
    "Cert\xa0" : "Cert",
    "Acquiring Institution\xa0" : "Acquiring Institution",
    "Closing Date\xa0" : "Closing Date",
    "Fund" : "Fund"
}

df = df.rename(columns=name_dict)

####  No further data cleaning has to be done. There are no null values or duplicated rows in this data set so no rows or columns need to be dropped. 

## 5. Exploratory Data Analysis


Exploratory Data Analysis
Univariate
When performing EDA on individual columns of data, thinking through the following acronym
can give you a better understanding of the data (CSOCS):
● Context - what data are we looking at
● Shape - how is your data distributed (boxplots, histograms)
● Outliers
● Center - mean, median, mode?
● Spread - Range, IQR, Standard Deviation


Multivariate
If necessary for your data, you can perform some analysis across multiple columns of data. We
took a look at the use of scatterplots for this, but you are free to analyze relationships between
your columns (features) in any meaningful way. Document what you’re doing as you go



Statistical Analysis
With your understanding of the dataset, formulate and answer at least 3 statistical questions.
Examples of statistical questions from the datasets we’ve worked with in class include:
● NY House Dataset:
○ Which sublocality has the most expensive cost per square foot?
○ What is the average number of bedrooms to bathrooms in Queens?
○ What is the most common type of property on sale?
● Rollercoaster Dataset:
○ What are the average speeds of the different types of rollercoasters?
○ Which manufacturer has been most successful in maintaining open coasters?
○ Where the fastest coasters are located?
In your Jupyter Notebook, create a new section for each question. You should document your
question, provide the code for any number value or graph, and document the conclusions you
came to.