# Instructor Turn Activity 1: Loc and iLoc

## Explanation: 
## DataFrame.iloc
#### Purely integer-location based indexing for selection by position.
#### .iloc[ ] is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

## DataFrame.loc
#### Purely label-location based indexer for selection by label.

#### .loc[ ] is primarily label based, but may also be used with a boolean array.

In [1]:
import pandas as pd

In [2]:
file = "Resources/sampleData.csv"

In [3]:
df_original = pd.read_csv(file)
df_original.head()

Unnamed: 0,id,first_name,last_name,Phone Number,Time zone
0,1,Peter,Richardson,7-(789)867-9023,Europe/Moscow
1,2,Janice,Berry,86-(614)973-1727,Asia/Harbin
2,3,Andrea,Hudson,86-(918)527-6371,Asia/Shanghai
3,4,Arthur,Mcdonald,420-(553)779-7783,Europe/Prague
4,5,Kathy,Morales,351-(720)541-2124,Europe/Lisbon


In [4]:
# Set new index to last_name
df = df_original.set_index("last_name")
df.head()

Unnamed: 0_level_0,id,first_name,Phone Number,Time zone
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Richardson,1,Peter,7-(789)867-9023,Europe/Moscow
Berry,2,Janice,86-(614)973-1727,Asia/Harbin
Hudson,3,Andrea,86-(918)527-6371,Asia/Shanghai
Mcdonald,4,Arthur,420-(553)779-7783,Europe/Prague
Morales,5,Kathy,351-(720)541-2124,Europe/Lisbon


In [5]:
# Grab the data contained within the "Berry" row and the "Phone Number" column
berry_phone = df.loc["Berry", "Phone Number"]
print("Using Loc: " + berry_phone)

also_berry_phone = df.iloc[1, 2]
print("Using Iloc: " + also_berry_phone)

Using Loc: 86-(614)973-1727
Using Iloc: 86-(614)973-1727


In [6]:
# Grab the first five rows of data and the columns from "id" to "Phone Number"
# The problem with using "last_name" as the index is that the values are not unique so duplicates are returned
# If there are duplicates and loc[] is being used, Pandas will return an error
richardson_to_morales = df.loc[["Richardson", "Berry", "Hudson",
                                "Mcdonald", "Morales"], ["id", "first_name", "Phone Number"]]
print(richardson_to_morales)

print()

# Using iloc[] will not find duplicates since a numeric index is always unique
also_richardson_to_morales = df.iloc[0:4, 0:3]
print(also_richardson_to_morales)

            id first_name       Phone Number
last_name                                   
Richardson   1      Peter    7-(789)867-9023
Richardson  25     Donald   62-(259)282-5871
Berry        2     Janice   86-(614)973-1727
Hudson       3     Andrea   86-(918)527-6371
Hudson       8    Frances   57-(752)864-4744
Hudson      90      Norma  351-(551)598-1822
Mcdonald     4     Arthur  420-(553)779-7783
Morales      5      Kathy  351-(720)541-2124

            id first_name       Phone Number
last_name                                   
Richardson   1      Peter    7-(789)867-9023
Berry        2     Janice   86-(614)973-1727
Hudson       3     Andrea   86-(918)527-6371
Mcdonald     4     Arthur  420-(553)779-7783


In [7]:
# The following will select all rows for columns `first_name` and `Phone Number`
df.loc[:, ["first_name", "Phone Number"]].head()

Unnamed: 0_level_0,first_name,Phone Number
last_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Richardson,Peter,7-(789)867-9023
Berry,Janice,86-(614)973-1727
Hudson,Andrea,86-(918)527-6371
Mcdonald,Arthur,420-(553)779-7783
Morales,Kathy,351-(720)541-2124


In [8]:
# the following logic test/conditional statement returns a series of boolean values
named_billy = df["first_name"] == "Billy"
named_billy.head()

last_name
Richardson    False
Berry         False
Hudson        False
Mcdonald      False
Morales       False
Name: first_name, dtype: bool

In [9]:
# Loc and Iloc also allow for conditional statments to filter rows of data
# using Loc on the logic test above only returns rows where the result is True
only_billys = df.loc[df["first_name"] == "Billy", :]
print(only_billys)

print()

# Multiple conditions can be set to narrow down or widen the filter
only_billy_and_peter = df.loc[(df["first_name"] == "Billy") | (
    df["first_name"] == "Peter"), :]
print(only_billy_and_peter)

           id first_name      Phone Number       Time zone
last_name                                                 
Clark      20      Billy  62-(213)345-2549   Asia/Makassar
Andrews    23      Billy  86-(859)746-5367  Asia/Chongqing
Price      59      Billy  86-(878)547-7739   Asia/Shanghai

            id first_name      Phone Number       Time zone
last_name                                                  
Richardson   1      Peter   7-(789)867-9023   Europe/Moscow
Clark       20      Billy  62-(213)345-2549   Asia/Makassar
Andrews     23      Billy  86-(859)746-5367  Asia/Chongqing
Price       59      Billy  86-(878)547-7739   Asia/Shanghai


## Students Turn Activity 2: Good Movies

### Instructions

  * Use Pandas to load and display the CSV provided in `Resources`.

  * List all the columns in the data set.

  * We're only interested in IMDb data, so create a new table that takes the Film and all the columns relating to IMDB.

  * Filter out only the good movies—i.e., any film with an IMDb score greater than or equal to 7 and remove the norm ratings.

  * Find less popular movies that you may not have heard about - i.e., anything with under 20K votes

  * Finally, export this file to a spreadsheet, excluding the index, so we can keep track of our future watchlist.

In [10]:
# Dependencies
import pandas as pd

In [11]:
# Load in File
movie_file = "Resources/movie_scores.csv"

In [14]:
# Read and display with pandas
movie_original = pd.read_csv(movie_file)
movie_original.head()

Unnamed: 0,FILM,RottenTomatoes,RottenTomatoes_User,Metacritic,Metacritic_User,IMDB,Fandango_Stars,Fandango_Ratingvalue,RT_norm,RT_user_norm,...,IMDB_norm,RT_norm_round,RT_user_norm_round,Metacritic_norm_round,Metacritic_user_norm_round,IMDB_norm_round,Metacritic_user_vote_count,IMDB_user_vote_count,Fandango_votes,Fandango_Difference
0,Avengers: Age of Ultron (2015),74,86,66,7.1,7.8,5.0,4.5,3.7,4.3,...,3.9,3.5,4.5,3.5,3.5,4.0,1330,271107,14846,0.5
1,Cinderella (2015),85,80,67,7.5,7.1,5.0,4.5,4.25,4.0,...,3.55,4.5,4.0,3.5,4.0,3.5,249,65709,12640,0.5
2,Ant-Man (2015),80,90,64,8.1,7.8,5.0,4.5,4.0,4.5,...,3.9,4.0,4.5,3.0,4.0,4.0,627,103660,12055,0.5
3,Do You Believe? (2015),18,84,22,4.7,5.4,5.0,4.5,0.9,4.2,...,2.7,1.0,4.0,1.0,2.5,2.5,31,3136,1793,0.5
4,Hot Tub Time Machine 2 (2015),14,28,29,3.4,5.1,3.5,3.0,0.7,1.4,...,2.55,0.5,1.5,1.5,1.5,2.5,88,19560,1021,0.5


In [15]:
# List all the columns the table provides
movie_original.columns

Index(['FILM', 'RottenTomatoes', 'RottenTomatoes_User', 'Metacritic',
       'Metacritic_User', 'IMDB', 'Fandango_Stars', 'Fandango_Ratingvalue',
       'RT_norm', 'RT_user_norm', 'Metacritic_norm', 'Metacritic_user_nom',
       'IMDB_norm', 'RT_norm_round', 'RT_user_norm_round',
       'Metacritic_norm_round', 'Metacritic_user_norm_round',
       'IMDB_norm_round', 'Metacritic_user_vote_count', 'IMDB_user_vote_count',
       'Fandango_votes', 'Fandango_Difference'],
      dtype='object')

In [40]:
# only care about Imdb, so create a new table that takes the Film and all the columns relating to IMDB
movie_short = movie_original[["FILM", "IMDB", "IMDB_norm", "IMDB_norm_round", "IMDB_user_vote_count"]]
movie_short.head()

Unnamed: 0,FILM,IMDB,IMDB_norm,IMDB_norm_round,IMDB_user_vote_count
0,Avengers: Age of Ultron (2015),7.8,3.9,4.0,271107
1,Cinderella (2015),7.1,3.55,3.5,65709
2,Ant-Man (2015),7.8,3.9,4.0,103660
3,Do You Believe? (2015),5.4,2.7,2.5,3136
4,Hot Tub Time Machine 2 (2015),5.1,2.55,2.5,19560


In [47]:
# List only movies whose ratings are over 7 (out of 10) in IMDB
#make a conditional statement, which will retrieve all data that the condition satisfies
#then throw it back into the dataframe
good_movie_easy = movie_short[movie_short["IMDB"] > 7]
good_movie_easy.head()

Unnamed: 0,FILM,IMDB,IMDB_norm,IMDB_norm_round,IMDB_user_vote_count
0,Avengers: Age of Ultron (2015),7.8,3.9,4.0,271107
1,Cinderella (2015),7.1,3.55,3.5,65709
2,Ant-Man (2015),7.8,3.9,4.0,103660
5,The Water Diviner (2015),7.2,3.6,3.5,39373
8,Shaun the Sheep Movie (2015),7.4,3.7,3.5,12227


In [48]:
good_movies = movie_short.loc[movie_short["IMDB"] > 7, ["FILM", "IMDB", "IMDB_user_vote_count"]]
good_movies.head()

Unnamed: 0,FILM,IMDB,IMDB_user_vote_count
0,Avengers: Age of Ultron (2015),7.8,271107
1,Cinderella (2015),7.1,65709
2,Ant-Man (2015),7.8,103660
5,The Water Diviner (2015),7.2,39373
8,Shaun the Sheep Movie (2015),7.4,12227


In [49]:
# Find lesser-known movies to watch, with fewer than 20K votes
unknown_movies = movie_short[movie_short["IMDB_user_vote_count"] < 20000]

In [50]:
# Finally, export this file to an Excel spreadsheet -- without the DataFrame index.
unknown_movies.to_excel("output/movieWatchlist.xlsx", index=False)

# Instructor Turn Activity 3: Cleaning Data


In [51]:
# Dependencies
import pandas as pd
import numpy as np

In [52]:
# Name of the CSV file
file = 'Resources/donors2008_unclean.csv'

In [53]:
# The correct encoding must be used to read the CSV in pandas
df = pd.read_csv(file, encoding="ISO-8859-1")

In [54]:
# Preview of the DataFrame
# Note that FIELD8 is likely a meaningless column
df.head()

Unnamed: 0,LastName,FirstName,Employer,City,State,Zip,Amount,FIELD8
0,Aaron,Eugene,State Department,Dulles,VA,20189,500.0,
1,Abadi,Barbara,Abadi & Co.,New York,NY,10021,200.0,
2,Adamany,Anthony,Retired,Rockford,IL,61103,500.0,
3,Adams,Lorraine,Self,New York,NY,10026,200.0,
4,Adams,Marion,,Exeter,NH,3833,100.0,


In [55]:
# Delete extraneous column
del df['FIELD8']
df.head()

Unnamed: 0,LastName,FirstName,Employer,City,State,Zip,Amount
0,Aaron,Eugene,State Department,Dulles,VA,20189,500.0
1,Abadi,Barbara,Abadi & Co.,New York,NY,10021,200.0
2,Adamany,Anthony,Retired,Rockford,IL,61103,500.0
3,Adams,Lorraine,Self,New York,NY,10026,200.0
4,Adams,Marion,,Exeter,NH,3833,100.0


In [56]:
# Identify incomplete rows
df.count()

LastName     1776
FirstName    1776
Employer     1743
City         1776
State        1776
Zip          1776
Amount       1776
dtype: int64

In [57]:
# Drop all rows with missing information
df = df.dropna(how='any')
df.head()

Unnamed: 0,LastName,FirstName,Employer,City,State,Zip,Amount
0,Aaron,Eugene,State Department,Dulles,VA,20189,500.0
1,Abadi,Barbara,Abadi & Co.,New York,NY,10021,200.0
2,Adamany,Anthony,Retired,Rockford,IL,61103,500.0
3,Adams,Lorraine,Self,New York,NY,10026,200.0
4,Adams,Marion,,Exeter,NH,3833,100.0


In [58]:
# Verify dropped rows# Verif 
df.count()

LastName     1743
FirstName    1743
Employer     1743
City         1743
State        1743
Zip          1743
Amount       1743
dtype: int64

In [59]:
# The Amount column is the wrong data type. It should be numeric.
df.dtypes

LastName      object
FirstName     object
Employer      object
City          object
State         object
Zip           object
Amount       float64
dtype: object

In [60]:
# Use pd.to_numeric() method to convert the datatype of the Amount column
df['Amount'] = pd.to_numeric(df['Amount'])

In [61]:
# Verify that the Amount column datatype has been made numeric
df['Amount'].dtype

dtype('float64')

In [62]:
# Display an overview of the Employers column
df['Employer'].value_counts()

None                                   249
Self                                   241
Retired                                126
Self Employed                           39
Self-Employed                           34
Google                                   6
Not Employed                             4
Unemployed                               4
Bank Of America                          3
Social Security Administration           3
University of California                 3
Kaiser Permanente                        2
Jones Day                                2
CSC                                      2
Rainey Cluss LLC                         2
Covington & Burling                      2
University Hospital                      2
Sidley Austin LLP                        2
Henry Crown & Company                    2
Davis Polk & Wardwell                    2
Skadden, Arps                            2
Northern Trust                           2
Freelance                                2
Mayer Brown

In [63]:
# Clean up Employer category. Replace 'Self Employed' and 'Self' with 'Self-Employed'
df['Employer'] = df['Employer'].replace(
    {'Self Employed': 'Self-Employed', 'Self': 'Self-Employed'})

In [64]:
# Verify clean-up.
df['Employer'].value_counts()

Self-Employed                          314
None                                   249
Retired                                126
Google                                   6
Not Employed                             4
Unemployed                               4
Social Security Administration           3
Bank Of America                          3
University of California                 3
Ariel Investments                        2
Hugo Neu Corporation                     2
Mayer Brown                              2
Microsoft                                2
Freelance                                2
University Hospital                      2
Sidley Austin LLP                        2
Covington & Burling                      2
Rainey Cluss LLC                         2
Mayer Brown LLP                          2
United Health Group                      2
Skadden, Arps                            2
Northern Trust                           2
Kaiser Permanente                        2
Jones Day  

In [65]:
df['Employer'] = df['Employer'].replace({'Not Employed': 'Unemployed'})
df['Employer'].value_counts()

Self-Employed                          314
None                                   249
Retired                                126
Unemployed                               8
Google                                   6
Bank Of America                          3
Social Security Administration           3
University of California                 3
Sidley Austin LLP                        2
Rainey Cluss LLC                         2
Ariel Investments                        2
Hugo Neu Corporation                     2
Mayer Brown                              2
Microsoft                                2
Freelance                                2
University Hospital                      2
Harvard University                       2
CSC                                      2
Covington & Burling                      2
Mayer Brown LLP                          2
United Health Group                      2
Henry Crown & Company                    2
Northern Trust                           2
Kaiser Perm

In [66]:
# Display a statistical overview
# We can infer the maximum allowable individual contribution from 'max'
df.describe()

Unnamed: 0,Amount
count,1743.0
mean,640.12475
std,1242.343265
min,5.0
25%,200.0
50%,250.0
75%,500.0
max,5000.0


## Students Work in Pairs Turn: Activity 4 Portland Crime

### Instructions

  * Read in the csv using Pandas and print out the DataFrame that is returned

  * Get a count of rows within the DataFrame in order to determine if there are any null values

  * Drop the rows which contain null values

  * Search through the "Offense Type" column and replace any similar values with one consistent value

  * Create a couple DataFrames that look into one Neighborhood only and print them to the screen

In [67]:
# Import Dependencies
import pandas as pd

In [68]:
# Reference the file where the CSV is located
crime_csv_path = "Resources/crime_incident_data2017.csv"
# Import the data into a Pandas DataFrame
crime_df = pd.read_csv(crime_csv_path)
crime_df.head()

Unnamed: 0,Address,Case Number,Crime Against,Neighborhood,Number of Records,Occur Date,Occur Month Year,Occur Time,Offense Category,Offense Count,Offense Type,Open Data Lat,Open Data Lon,Open Data X,Open Data Y,Report Date,Report Month Year
0,,17-X4762181,Person,,1,1/1/96,1/1/96,800,Sex Offenses,1.0,Rape,,,,,1/26/17,1/1/17
1,,17-X4757824,Property,Centennial,1,1/20/00,1/1/00,1615,Fraud Offenses,1.0,Identity Theft,,,,,1/20/17,1/1/17
2,200 BLOCK OF SE 78TH AVE,17-900367,Property,Montavilla,1,12/1/03,12/1/03,800,Fraud Offenses,1.0,False Pretenses/Swindle/Confidence Game,45.5207,-122.583,7668150.0,682825.0,1/9/17,1/1/17
3,,17-X4748982,Property,Southwest Hills,1,1/1/10,1/1/10,0,Fraud Offenses,1.0,Identity Theft,,,,,1/5/17,1/1/17
4,,17-X4748982,Property,Southwest Hills,1,1/1/10,1/1/10,0,Larceny Offenses,1.0,All Other Larceny,,,,,1/5/17,1/1/17


In [70]:
# look for missing values
crime_df.count()

Address              6456
Case Number          7060
Crime Against        7060
Neighborhood         6865
Number of Records    7060
Occur Date           7060
Occur Month Year     7060
Occur Time           7060
Offense Category     7060
Offense Count        7059
Offense Type         7059
Open Data Lat        6348
Open Data Lon        6348
Open Data X          6348
Open Data Y          6348
Report Date          7059
Report Month Year    7059
dtype: int64

In [71]:
# drop null rows
crime_df = crime_df.dropna(how='any')
crime_df.head()

Unnamed: 0,Address,Case Number,Crime Against,Neighborhood,Number of Records,Occur Date,Occur Month Year,Occur Time,Offense Category,Offense Count,Offense Type,Open Data Lat,Open Data Lon,Open Data X,Open Data Y,Report Date,Report Month Year
2,200 BLOCK OF SE 78TH AVE,17-900367,Property,Montavilla,1,12/1/03,12/1/03,800,Fraud Offenses,1.0,False Pretenses/Swindle/Confidence Game,45.5207,-122.583,7668150.0,682825.0,1/9/17,1/1/17
5,5400 BLOCK OF NE MALLORY AVE,17-900129,Property,King,1,11/28/10,11/1/10,1612,Fraud Offenses,1.0,Identity Theft,45.5625,-122.664,7647987.0,698581.0,1/3/17,1/1/17
6,5000 BLOCK OF NE 19TH AVE,17-901079,Property,Vernon,1,11/8/13,11/1/13,1200,Fraud Offenses,1.0,False Pretenses/Swindle/Confidence Game,45.5594,-122.646,7652567.0,697337.0,1/26/17,1/1/17
7,5000 BLOCK OF NE 19TH AVE,17-901079,Property,Vernon,1,11/8/13,11/1/13,1200,Fraud Offenses,1.0,Identity Theft,45.5594,-122.646,7652567.0,697337.0,1/26/17,1/1/17
8,12000 BLOCK OF SE PINE ST,17-900253,Property,Hazelwood,1,1/6/14,1/1/14,805,Fraud Offenses,1.0,Credit Card/ATM Fraud,45.5204,-122.539,7679522.0,682404.0,1/6/17,1/1/17


In [72]:
# verify counts
crime_df.count()

Address              6272
Case Number          6272
Crime Against        6272
Neighborhood         6272
Number of Records    6272
Occur Date           6272
Occur Month Year     6272
Occur Time           6272
Offense Category     6272
Offense Count        6272
Offense Type         6272
Open Data Lat        6272
Open Data Lon        6272
Open Data X          6272
Open Data Y          6272
Report Date          6272
Report Month Year    6272
dtype: int64

In [73]:
# Check to see if there are any values with mispelled or similar values in "Offense Type"
crime_df['Offense Type'].value_counts()

Theft From Motor Vehicle                       1157
Motor Vehicle Theft                             977
All Other Larceny                               751
Vandalism                                       559
Burglary                                        510
Shoplifting                                     460
Identity Theft                                  333
Simple Assault                                  198
Theft of Motor Vehicle Parts or Accessories     172
False Pretenses/Swindle/Confidence Game         166
Drug/Narcotic Violations                        150
Theft From Building                             148
Intimidation                                    148
Aggravated Assault                              131
Robbery                                         114
Counterfeiting/Forgery                          104
Credit Card/ATM Fraud                            38
Weapons Law Violations                           37
Prostitution                                     36
Arson       

In [75]:
# Combine similar offenses
crime_df['Offense Type'] = crime_df['Offense Type'].replace(
    {'Theft From Motor Vehicle': 'Motor Vehicle Theft', 'All Other Larceny': 'Burglary'})
print(crime_df['Offense Type'].value_counts())

Motor Vehicle Theft                            2134
Burglary                                       1261
Vandalism                                       559
Shoplifting                                     460
Identity Theft                                  333
Simple Assault                                  198
Theft of Motor Vehicle Parts or Accessories     172
False Pretenses/Swindle/Confidence Game         166
Drug/Narcotic Violations                        150
Theft From Building                             148
Intimidation                                    148
Aggravated Assault                              131
Robbery                                         114
Counterfeiting/Forgery                          104
Credit Card/ATM Fraud                            38
Weapons Law Violations                           37
Prostitution                                     36
Arson                                            19
Embezzlement                                     17
Purse-Snatch

In [77]:
# Create a new DataFrame that looks into a specific neighborhood
neighborhood = crime_df[crime_df["Neighborhood"] == "King"]
neighborhood.head()

#reason to use loc over the above -> second argument of loc is columns. if you want to eliminate columns, use loc
#or
#neighborhood = crime_df.lov[crime_df["Neighborhood"] == "King"]

Unnamed: 0,Address,Case Number,Crime Against,Neighborhood,Number of Records,Occur Date,Occur Month Year,Occur Time,Offense Category,Offense Count,Offense Type,Open Data Lat,Open Data Lon,Open Data X,Open Data Y,Report Date,Report Month Year
5,5400 BLOCK OF NE MALLORY AVE,17-900129,Property,King,1,11/28/10,11/1/10,1612,Fraud Offenses,1.0,Identity Theft,45.5625,-122.664,7647987.0,698581.0,1/3/17,1/1/17
390,5500 BLOCK OF NE MARTIN LUTHER KING JR BLVD,17-9250,Property,King,1,12/10/16,12/1/16,0,Larceny Offenses,1.0,Burglary,45.5631,-122.661,7648603.0,698786.0,1/10/17,1/1/17
611,900 BLOCK OF NE ALBERTA ST,17-900046,Property,King,1,1/1/17,1/1/17,925,Larceny Offenses,1.0,Burglary,45.5592,-122.656,7649920.0,697323.0,1/2/17,1/1/17
708,NE 6TH AVE / NE BEECH ST,17-1842,Society,King,1,1/2/17,1/1/17,2144,Weapon Law Violations,1.0,Weapons Law Violations,45.5495,-122.66,7648922.0,693827.0,1/2/17,1/1/17
715,1400 BLOCK OF NE ALBERTA ST,17-900118,Property,King,1,1/1/17,1/1/17,1355,Vandalism,1.0,Vandalism,45.5591,-122.651,7651172.0,697285.0,1/3/17,1/1/17


## Everyone Turn: Activity 5 Pandas Recap

In [78]:
# Import the Pandas library
import pandas as pd

In [79]:
# Create a reference the CSV file desired
csv_path = "Resources/ufoSightings.csv"

# Read the CSV into a Pandas DataFrame
ufo_df = pd.read_csv(csv_path)

# Print the first five rows of data to the screen
ufo_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611


In [80]:
# Check to see if there are any rows with missing data
ufo_df.count()

datetime                6465
city                    6464
state                   6124
country                 5760
shape                   6345
duration (seconds)      6464
duration (hours/min)    6464
comments                6463
date posted             6464
latitude                6464
longitude               6464
dtype: int64

In [81]:
# Remove the rows with missing data
clean_ufo_df = ufo_df.dropna(how="any")
clean_ufo_df.count()

datetime                5512
city                    5512
state                   5512
country                 5512
shape                   5512
duration (seconds)      5512
duration (hours/min)    5512
comments                5512
date posted             5512
latitude                5512
longitude               5512
dtype: int64

In [82]:
# Filter the data so that only those sightings in the US are in a DataFrame
usa_ufo_df = clean_ufo_df.loc[clean_ufo_df["country"] == "us", :]
usa_ufo_df

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611
5,10/10/1961 19:00,bristol,tn,us,sphere,300.0,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.595000,-82.188889
7,10/10/1965 23:45,norwalk,ct,us,disk,1200.0,20 minutes,A bright orange color changing to reddish colo...,10/2/1999,41.117500,-73.408333
8,10/10/1966 20:00,pell city,al,us,disk,180.0,3 minutes,Strobe Lighted disk shape object observed clos...,3/19/2009,33.586111,-86.286111
9,10/10/1966 21:00,live oak,fl,us,disk,120.0,several minutes,Saucer zaps energy from powerline as my pregna...,5/11/2005,30.294722,-82.984167
10,10/10/1968 13:00,hawthorne,ca,us,circle,300.0,5 min.,ROUND &#44 ORANGE &#44 WITH WHAT I WOULD SAY W...,10/31/2003,33.916389,-118.351667
11,10/10/1968 19:00,brevard,nc,us,fireball,180.0,3 minutes,silent red /orange mass of energy floated by t...,6/12/2008,35.233333,-82.734444
12,10/10/1970 16:00,bellmore,ny,us,disk,1800.0,30 min.,silver disc seen by family and neighbors,5/11/2000,40.668611,-73.527500


In [83]:
# Count how many sightings have occured within each state
state_counts = usa_ufo_df["state"].value_counts()
state_counts

ca    629
tx    353
wa    295
fl    276
ny    245
il    240
oh    203
pa    197
az    191
nc    175
co    138
mi    132
or    128
tn    124
mo    121
va    120
in    106
ga    103
sc     95
wi     95
ma     85
mn     79
nj     76
ky     75
md     74
nv     68
ct     67
ok     65
nm     60
ar     60
al     57
ia     57
ut     56
ks     55
me     44
nh     44
la     41
id     37
ms     35
mt     35
wv     33
ak     32
ne     27
hi     20
vt     18
ri     17
sd     17
wy     16
de     11
nd      6
pr      1
Name: state, dtype: int64

In [84]:
# Convert the state_counts Series into a DataFrame
state_ufo_counts_df = pd.DataFrame(state_counts)
state_ufo_counts_df.head()

Unnamed: 0,state
ca,629
tx,353
wa,295
fl,276
ny,245


In [85]:
# Convert the column name into "Sum of Sightings"
state_ufo_counts_df = state_ufo_counts_df.rename(
    columns={"state": "Sum of Sightings"})
state_ufo_counts_df.head()

Unnamed: 0,Sum of Sightings
ca,629
tx,353
wa,295
fl,276
ny,245


In [86]:
# Want to add up the seconds UFOs are seen? There is a problem
# Problem can be seen by examining datatypes within the DataFrame
usa_ufo_df.dtypes

datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)      float64
duration (hours/min)     object
comments                 object
date posted              object
latitude                float64
longitude               float64
dtype: object

In [87]:
# Using to_numeric() to convert a column's data into floats
usa_ufo_df["duration (seconds)"] = pd.to_numeric(
    usa_ufo_df["duration (seconds)"])
usa_ufo_df.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


datetime                 object
city                     object
state                    object
country                  object
shape                    object
duration (seconds)      float64
duration (hours/min)     object
comments                 object
date posted              object
latitude                float64
longitude               float64
dtype: object

In [88]:
# Now it is possible to find the sum of seconds
usa_ufo_df["duration (seconds)"].sum()

9021339.65

## Instructor Turn: Activity 6 Group By

In [89]:
# Import Dependencies
import pandas as pd

In [90]:
# Create a reference the CSV file desired
csv_path = "Resources/ufoSightings.csv"

# Read the CSV into a Pandas DataFrame
ufo_df = pd.read_csv(csv_path)

# Print the first five rows of data to the screen
ufo_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200.0,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20.0,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611


In [91]:
# Remove the rows with missing data
clean_ufo_df = ufo_df.dropna(how="any")
clean_ufo_df.count()

datetime                5512
city                    5512
state                   5512
country                 5512
shape                   5512
duration (seconds)      5512
duration (hours/min)    5512
comments                5512
date posted             5512
latitude                5512
longitude               5512
dtype: int64

In [92]:
# Converting the "duration (seconds)" column's values to numeric
clean_ufo_df["duration (seconds)"] = pd.to_numeric(
    clean_ufo_df["duration (seconds)"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [93]:
# Filter the data so that only those sightings in the US are in a DataFrame
usa_ufo_df = clean_ufo_df.loc[clean_ufo_df["country"] == "us", :]

usa_ufo_df.head()

Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700.0,45 minutes,This event took place in early fall around 194...,4/27/2004,29.883056,-97.941111
3,10/10/1956 21:00,edna,tx,us,circle,20.0,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.978333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900.0,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.418056,-157.803611
5,10/10/1961 19:00,bristol,tn,us,sphere,300.0,5 minutes,My father is now 89 my brother 52 the girl wit...,4/27/2007,36.595,-82.188889
7,10/10/1965 23:45,norwalk,ct,us,disk,1200.0,20 minutes,A bright orange color changing to reddish colo...,10/2/1999,41.1175,-73.408333


In [94]:
# Count how many sightings have occured within each state
state_counts = usa_ufo_df["state"].value_counts()
state_counts.head()

ca    629
tx    353
wa    295
fl    276
ny    245
Name: state, dtype: int64

In [95]:
# Using GroupBy in order to separate the data into fields according to "state" values
grouped_usa_df = usa_ufo_df.groupby(['state'])

# The object returned is a "GroupBy" object and cannot be viewed normally...
print(grouped_usa_df)

# In order to be visualized, a data function must be used...
grouped_usa_df.count().head(10)

<pandas.core.groupby.DataFrameGroupBy object at 0x00000000094D66D8>


Unnamed: 0_level_0,datetime,city,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ak,32,32,32,32,32,32,32,32,32,32
al,57,57,57,57,57,57,57,57,57,57
ar,60,60,60,60,60,60,60,60,60,60
az,191,191,191,191,191,191,191,191,191,191
ca,629,629,629,629,629,629,629,629,629,629
co,138,138,138,138,138,138,138,138,138,138
ct,67,67,67,67,67,67,67,67,67,67
de,11,11,11,11,11,11,11,11,11,11
fl,276,276,276,276,276,276,276,276,276,276
ga,103,103,103,103,103,103,103,103,103,103


In [96]:
# Since "duration (seconds)" was converted to a numeric time, it can now be summed up per state
state_duration = grouped_usa_df["duration (seconds)"].sum()
state_duration.head()

state
ak     67915.0
al     40713.0
ar    233053.0
az    365206.1
ca    460920.1
Name: duration (seconds), dtype: float64

In [97]:
# Creating a new DataFrame using both duration and count
state_summary_table = pd.DataFrame({"Number of Sightings": state_counts,
                                    "Total Visit Time": state_duration})
state_summary_table.head()

Unnamed: 0,Number of Sightings,Total Visit Time
ak,32,67915.0
al,57,40713.0
ar,60,233053.0
az,191,365206.1
ca,629,460920.1


In [98]:
# It is also possible to group a DataFrame by multiple columns
# This returns an object with multiple indexes, however, which can be harder to deal with
grouped_international_data = clean_ufo_df.groupby(['country', 'state'])

# Converting a GroupBy object into a DataFrame
international_duration = pd.DataFrame(
    grouped_international_data["duration (seconds)"].sum())
international_duration.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,duration (seconds)
country,state,Unnamed: 2_level_1
au,nt,60.0
au,wa,30.0
ca,ab,5137.0
ca,bc,25612.0
ca,mb,42150.0
ca,nb,8940.0
ca,nf,7205.0
ca,ns,645.0
ca,on,68052.0
ca,pq,4397.0


##  Students Work in Pairs Turn: Activity 7 Building a Pokèdex

### Instructions

  * Read the Pokemon CSV file with Pandas.

  * Create a new table by extracting the following columns: "Type 1", "HP", "Attack", "Sp. Atk", "Sp. Def", and "Speed".

  * Find the average stats for each type of Pokemon.

  * Create a new DataFrame out of the averages.

  * Calculate the total power level of each type of Pokemon by adding all of the previous stats together and place the results into a new column.

### Bonus

  * Sort the table by strongest type and export the resulting table to a new CSV

In [99]:
# Dependencies
import pandas as pd
import numpy as np

In [100]:
# File to load Save file path to variable
pokemon = "Resources/Pokemon.csv" 

In [101]:
# Read with Pandas
pokemon_df = pd.read_csv(pokemon)

pokemon_df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [102]:
# Extract the following columns: "Type 1", "HP", "Attack", "Sp. Atk", "Sp. Def", and "Speed"
pokemon_extract_df = pokemon_df[["Type 1", "HP", "Attack", "Sp. Atk", "Sp. Def", "Speed"]]
pokemon_extract_df.head()

Unnamed: 0,Type 1,HP,Attack,Sp. Atk,Sp. Def,Speed
0,Grass,45,49,65,65,45
1,Grass,60,62,80,80,60
2,Grass,80,82,100,100,80
3,Grass,80,100,122,120,80
4,Fire,39,52,60,50,65


In [108]:
# Create the GroupBy object based on the "Type 1" column

pokemon_type = pokemon_extract_df.groupby(['Type 1'])

# Calculate averages for combat stats using the .mean() method
pokemon_type_df = pd.DataFrame(
    pokemon_type["HP", "Attack", "Sp. Atk", "Sp. Def", "Speed"].mean())

pokemon_type_df

Unnamed: 0_level_0,HP,Attack,Sp. Atk,Sp. Def,Speed
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bug,56.884058,70.971014,53.869565,64.797101,61.681159
Dark,66.806452,88.387097,74.645161,69.516129,76.16129
Dragon,83.3125,112.125,96.84375,88.84375,83.03125
Electric,59.795455,69.090909,90.022727,73.704545,84.5
Fairy,74.117647,61.529412,78.529412,84.705882,48.588235
Fighting,69.851852,96.777778,53.111111,64.703704,66.074074
Fire,69.903846,84.769231,88.980769,72.211538,74.442308
Flying,70.75,78.75,94.25,72.5,102.5
Ghost,64.4375,73.78125,79.34375,76.46875,64.34375
Grass,67.271429,73.214286,77.5,70.428571,61.928571


In [112]:
# Total number of points
pokemon_type_df["Total"] = pokemon_type_df["HP"] + pokemon_type_df["Attack"] + \
    pokemon_type_df["Sp. Atk"] + pokemon_type_df["Sp. Def"] + pokemon_type_df["Speed"]
    
pokemon_type_df["Total"]

Type 1
Bug         308.202899
Dark        375.516129
Dragon      464.156250
Electric    377.113636
Fairy       347.470588
Fighting    350.518519
Fire        390.307692
Flying      418.750000
Ghost       358.375000
Grass       350.342857
Ground      352.656250
Ice         362.041667
Normal      341.836735
Poison      330.321429
Psychic     408.263158
Rock        352.954545
Steel       361.333333
Water       357.508929
Name: Total, dtype: float64

In [116]:
# Bonus: Sort the table by strongest type and export the resulting table to a new CSV.
#ascending is the only one, there is no descending
strongest = pokemon_type_df.sort_values(["Total"], ascending = False)
strongest.reset_index(inplace = True) #inplace permanently saves the changes
strongest

Unnamed: 0,Type 1,HP,Attack,Sp. Atk,Sp. Def,Speed,Total
0,Dragon,83.3125,112.125,96.84375,88.84375,83.03125,464.15625
1,Flying,70.75,78.75,94.25,72.5,102.5,418.75
2,Psychic,70.631579,71.45614,98.403509,86.280702,81.491228,408.263158
3,Fire,69.903846,84.769231,88.980769,72.211538,74.442308,390.307692
4,Electric,59.795455,69.090909,90.022727,73.704545,84.5,377.113636
5,Dark,66.806452,88.387097,74.645161,69.516129,76.16129,375.516129
6,Ice,72.0,72.75,77.541667,76.291667,63.458333,362.041667
7,Steel,65.222222,92.703704,67.518519,80.62963,55.259259,361.333333
8,Ghost,64.4375,73.78125,79.34375,76.46875,64.34375,358.375
9,Water,72.0625,74.151786,74.8125,70.517857,65.964286,357.508929


## Instructor Turn: Activity 8 Sorting

In [117]:
# Import Dependencies
import pandas as pd

In [118]:
csv_path = "Resources/Happiness_2017.csv"
happiness_df = pd.read_csv(csv_path)
happiness_df.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
3,Switzerland,4,7.494,7.561772,7.426227,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007,2.276716
4,Finland,5,7.469,7.527542,7.410458,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612,2.430182


In [119]:
# Sorting the DataFrame based on "Freedom" column
# Will sort from lowest to highest if no other parameter is passed
freedom_df = happiness_df.sort_values("Freedom")
freedom_df.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
139,Angola,140,3.795,3.951642,3.638358,0.858428,1.104412,0.049869,0.0,0.097926,0.06972,1.614482
129,Sudan,130,4.139,4.345747,3.932253,0.659517,1.214009,0.290921,0.014996,0.182317,0.089848,1.687066
144,Haiti,145,3.603,3.734715,3.471285,0.36861,0.64045,0.277321,0.03037,0.489204,0.099872,1.697168
153,Burundi,154,2.905,3.07469,2.73531,0.091623,0.629794,0.151611,0.059901,0.204435,0.084148,1.683024
151,Syria,152,3.462,3.663669,3.260331,0.777153,0.396103,0.500533,0.081539,0.493664,0.151347,1.061574


In [120]:
# To sort from highest to lowest, ascending=False must be passed in
freedom_df = happiness_df.sort_values("Freedom", ascending=False)
freedom_df.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
46,Uzbekistan,47,5.971,6.065538,5.876463,0.786441,1.548969,0.498273,0.658249,0.415984,0.246528,1.816914
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
128,Cambodia,129,4.168,4.278518,4.057483,0.601765,1.006238,0.429783,0.633376,0.385923,0.068106,1.042941
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707


In [121]:
# It is possible to sort based upon multiple columns
family_and_generosity = happiness_df.sort_values(
    ["Family", "Generosity"], ascending=False)
family_and_generosity.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
14,Ireland,15,6.977,7.043352,6.910649,1.535707,1.558231,0.809783,0.57311,0.427858,0.298388,1.773869
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
46,Uzbekistan,47,5.971,6.065538,5.876463,0.786441,1.548969,0.498273,0.658249,0.415984,0.246528,1.816914
7,New Zealand,8,7.314,7.37951,7.24849,1.405706,1.548195,0.81676,0.614062,0.500005,0.382817,2.046456


In [122]:
# To sort from highest to lowest, ascending=False must be passed in
freedom_df = happiness_df.sort_values("Freedom", ascending=False)
freedom_df.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
46,Uzbekistan,47,5.971,6.065538,5.876463,0.786441,1.548969,0.498273,0.658249,0.415984,0.246528,1.816914
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
128,Cambodia,129,4.168,4.278518,4.057483,0.601765,1.006238,0.429783,0.633376,0.385923,0.068106,1.042941
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707


In [123]:
# The index can be reset to provide index numbers based on the new rankings.
new_index = family_and_generosity.reset_index(drop=True)
new_index.head()

Unnamed: 0,Country,Happiness.Rank,Happiness.Score,Whisker.high,Whisker.low,Economy..GDP.per.Capita.,Family,Health..Life.Expectancy.,Freedom,Generosity,Trust..Government.Corruption.,Dystopia.Residual
0,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
1,Ireland,15,6.977,7.043352,6.910649,1.535707,1.558231,0.809783,0.57311,0.427858,0.298388,1.773869
2,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
3,Uzbekistan,47,5.971,6.065538,5.876463,0.786441,1.548969,0.498273,0.658249,0.415984,0.246528,1.816914
4,New Zealand,8,7.314,7.37951,7.24849,1.405706,1.548195,0.81676,0.614062,0.500005,0.382817,2.046456


## Students Turn: Activity 9 Search For The Worst

### Instructions

  * Read in the CSV file provided and print it to the screen.

  * Print out a list of all of the values within the "Preferred Position" column.

  * Select a value from this list and create a new DataFrame that only includes players who prefer that position.

  * Sort the DataFrame based upon a player's skill in that position.

  * Reset the index for the DataFrame so that the index is in order.

  * Print out the statistics for the worst player in a position to the screen.

In [124]:
# Import Dependencies
import pandas as pd
import numpy as np

In [131]:
# Create reference to CSV file
csv_path = "Resources/Soccer2018Data.csv"

# Import the CSV into a pandas DataFrame
soccer_df = pd.read_csv(csv_path)
soccer_df.head()

Unnamed: 0,Name,Age,Nationality,Overall,Potential,Club,Preferred Position,CAM,CB,CDM,...,RB,RCB,RCM,RDM,RF,RM,RS,RW,RWB,ST
0,Cristiano Ronaldo,32,Portugal,94,94,Real Madrid CF,ST,89.0,53.0,62.0,...,61.0,53.0,82.0,62.0,91.0,89.0,92.0,91.0,66.0,92.0
1,L. Messi,30,Argentina,93,93,FC Barcelona,RW,92.0,45.0,59.0,...,57.0,45.0,84.0,59.0,92.0,90.0,88.0,91.0,62.0,88.0
2,Neymar,25,Brazil,92,94,Paris Saint-Germain,LW,88.0,46.0,59.0,...,59.0,46.0,79.0,59.0,88.0,87.0,84.0,89.0,64.0,84.0
3,L. SuÃ¡rez,30,Uruguay,92,92,FC Barcelona,ST,87.0,58.0,65.0,...,64.0,58.0,80.0,65.0,88.0,85.0,88.0,87.0,68.0,88.0
4,M. Neuer,31,Germany,92,92,FC Bayern Munich,GK,,,,...,,,,,,,,,,


In [132]:
# Collect a list of all the unique values in "Preferred Position"
soccer_df["Preferred Position"].unique()

array(['ST', 'RW', 'LW', 'GK', 'CDM', 'CB', 'RM', 'CM', 'LM', 'LB', 'CAM',
       'RB', 'CF', 'RWB', 'LWB', nan], dtype=object)

In [137]:
# Looking only at strikers (ST) to start
strikers = soccer_df[["Name","ST"]]
strikers.head()

Unnamed: 0,Name,ST
0,Cristiano Ronaldo,92.0
1,L. Messi,88.0
2,Neymar,84.0
3,L. SuÃ¡rez,88.0
4,M. Neuer,


In [151]:
# Sort the DataFrame by the values in the "ST" column to find the worst
striker_order = soccer_df.sort_values("ST")
striker_order.head()
# Reset the index so that the index is now based on the sorting locations
new_index = striker_order.reset_index(drop=True)
new_index.head()

Unnamed: 0,Name,Age,Nationality,Overall,Potential,Club,Preferred Position,CAM,CB,CDM,...,RB,RCB,RCM,RDM,RF,RM,RS,RW,RWB,ST
0,Oh Ban Suk,29,Korea Republic,70,70,Jeju United FC,CB,35.0,69.0,59.0,...,58.0,69.0,42.0,59.0,34.0,36.0,38.0,33.0,54.0,38.0
1,J. Vergara,23,Colombia,70,80,Milan,CB,39.0,69.0,61.0,...,60.0,69.0,45.0,61.0,37.0,41.0,38.0,38.0,57.0,38.0
2,F. Poli,28,Italy,69,70,Carpi,CB,35.0,68.0,57.0,...,60.0,68.0,40.0,57.0,34.0,37.0,38.0,35.0,55.0,38.0
3,D. Maietta,34,Italy,74,74,Bologna,CB,41.0,73.0,65.0,...,62.0,73.0,49.0,65.0,39.0,42.0,39.0,39.0,59.0,39.0
4,N. Cherubin,30,Italy,70,70,Hellas Verona,CB,41.0,69.0,62.0,...,60.0,69.0,48.0,62.0,39.0,42.0,40.0,39.0,57.0,40.0


In [156]:
# Save all of the information collected on the worst striker
worst = new_index.iloc[0]
print(worst)

Name                      Oh Ban Suk
Age                               29
Nationality           Korea Republic
Overall                           70
Potential                         70
Club                  Jeju United FC
Preferred Position                CB
CAM                               35
CB                                69
CDM                               59
CF                                34
CM                                42
LAM                               35
LB                                58
LCB                               69
LCM                               42
LDM                               59
LF                                34
LM                                36
LS                                38
LW                                33
LWB                               54
RAM                               35
RB                                58
RCB                               69
RCM                               42
RDM                               59
R