# Week6 Load & Wrangle Data

## Working with File Objects

Before we move to working with different file formats, first we should understand reading/writing to a file in python.

In [4]:
!cat some_file.txt

test
test
test

In [3]:
## let's just use built in open method to create a file some_file.txt
f = open('some_file.txt', 'w' )
## note that we can access some attributes of f like name and mode  
print(f.name, f.mode)
# ## don't forget to close the file
f.write('test\n')
f.write('test\ntest')
f.close()

some_file.txt w


Instead of closing a file manually each time, we can use the following style.

- Before the next cell we can add some more line to some_file.txt manually. 

In [5]:
## better practice to open a file in a context
with open('some_file.txt', 'r') as file:

    ## open some_file.txt in reading mode.
    ## readline from file and assign it to line1
    line1 = file.readline()
    # file.seek(0)
    ## Read Another line
    line2 = file.readline()

In [6]:
## Let's print line1 and line2
print(line1)
print(line2)

test

test



In [7]:
with open('some_file.txt', 'r') as file:
    lines = file.readlines()
    for line in lines:
        print(line)

test

test

test


In [8]:
with open('some_file.txt', 'r') as file:
    while(True):
        line = file.readline()
        
        if not line:
            break
        
        # processing logic
        print(line)

test

test

test


In [9]:
## Sometimes a file might be too big to read in once
with open('some_file.txt', 'r') as text_file:
## It's always better practice to read a certain amount of char 
## when we read a file.

## open some_file.txt again 
    ## use read method with 100 character store it in a varible named 'content'
    content = text_file.read(10)
    ## print content
    print(content)

test
test



In [14]:
## In a similar way we can write on files 
with open('some_other_file.txt', 'w') as w_file:
## open a new file and write 'hello world!' in it.
    w_file.write('Hello Class!')
    w_file.write('Hello Online Session!')


In [15]:
!cat some_other_file.txt

Hello Class!Hello Online Session!

## Working with CSV files

As we discussed in the first part of the lectures, `csv` files are one of the most common data formats used in data science.

__Before We Start__

- Let's create some random student names-lastnames and their emails

__Writing a csv file__

In [23]:
students = """John,Doe,john-doe@umbc.edu
Marcellus,Till,marcellus-till@umbc.edu
Glad,Baker,glad-baker@umbc.edu
Duncan,Veyne,duncan-veyne@umbc.edu"""

## Open a file with in writing mode call the file as csvfile
with open('my_csv_file.csv', 'w') as csvfile:
    csvfile.write(students)

In [24]:
!cat my_csv_file.csv

John,Doe,john-doe@umbc.edu
Marcellus,Till,marcellus-till@umbc.edu
Glad,Baker,glad-baker@umbc.edu
Duncan,Veyne,duncan-veyne@umbc.edu

In [38]:
import pandas as pd
pd.read_csv('my_csv_file.csv')

Unnamed: 0,John,"Doe, Jr""",john-doe@umbc.edu
0,Marcellus,Till,marcellus-till@umbc.edu
1,Glad,Baker,glad-baker@umbc.edu
2,Duncan,Veyne,duncan-veyne@umbc.edu


In [39]:
students = [
 ['John', 'Doe, Jr', 'john-doe@umbc.edu'],
 ['Marcellus', 'Till', 'marcellus-till@umbc.edu'],
 ['Glad', 'Baker', 'glad-baker@umbc.edu'],
 ['Duncan', 'Veyne', 'duncan-veyne@umbc.edu']
 ]

In [40]:
import csv

In [41]:
## Open a file with in writing mode call the file as csvfile
with open('my_csv_file.csv', 'w') as csvfile:
  ## instantiate the csv.writer with csvfile
  csv_writer = csv.writer(csvfile)

  ## go through each row in students
  for row in students:
    ## use writerow method to write each row.
    csv_writer.writerow(row)

In [42]:
!cat my_csv_file.csv

John,"Doe, Jr",john-doe@umbc.edu
Marcellus,Till,marcellus-till@umbc.edu
Glad,Baker,glad-baker@umbc.edu
Duncan,Veyne,duncan-veyne@umbc.edu


__Reading from a csv file__

In [43]:
with open('my_csv_file.csv', 'r') as file:
    while(True):
        line = file.readline()
        
        if not line:
            break
        
        
#         print(line)
        line = line.replace('\n', '')
        print(line.split(','))
        

['John', '"Doe', ' Jr"', 'john-doe@umbc.edu']
['Marcellus', 'Till', 'marcellus-till@umbc.edu']
['Glad', 'Baker', 'glad-baker@umbc.edu']
['Duncan', 'Veyne', 'duncan-veyne@umbc.edu']


In [44]:
## open the file my_csv_file.csv in reading mode
with open('my_csv_file.csv', mode = 'r') as csvfile: 
  ## instantiate csv.reader as reader
  reader = csv.reader(csvfile)
  ## go over reader line by line
  for line in reader:
    ## print each line
    print(line)


['John', 'Doe, Jr', 'john-doe@umbc.edu']
['Marcellus', 'Till', 'marcellus-till@umbc.edu']
['Glad', 'Baker', 'glad-baker@umbc.edu']
['Duncan', 'Veyne', 'duncan-veyne@umbc.edu']


We can also convert dictionaries to csv files or vice versa

In [45]:
## Keep this cell but explain each line

## open a new file in writing mode 
with open('names.csv', 'w') as csvfile:
    ## create a list of fieldnames = ['first_names', 'last_names']
    fieldnames = ['first_name', 'last_name']
    ## instantiate csv.DictWriter with file and fieldnames as writer
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    ## call writeheader method from writer object
    writer.writeheader()
    
    ## enter each row
    writer.writerow({'first_name': 'Baked', 'last_name': 'Beans'})
    writer.writerow({'first_name': 'Lovely', 'last_name': 'Spam'})
    writer.writerow({'first_name': 'Wonderful', 'last_name': 'Spam'})

In [47]:
!cat names.csv

first_name,last_name
Baked,Beans
Lovely,Spam
Wonderful,Spam


__Your Turn__

- We created a new .csv file above named names.csv
- Use what we learned to read this file line by line.

In [48]:
with open('names.csv', 'r') as f:
    lines = f.readlines()
    for line in lines:
        print(line)

first_name,last_name

Baked,Beans

Lovely,Spam

Wonderful,Spam



Similarly we could read this file into a python dictionary

In [49]:
lines = []
## open the file my_csv_file.csv in reading mode
with open('names.csv', mode = 'r') as csvfile:
    ## instantiate csv.reader as reader
    reader = csv.DictReader(csvfile, fieldnames= ['first_name', 'last_name'])
    csvfile.readline()
    ## go over reader line by line
    for line in reader:
        lines.append(line)
print(lines)

[{'first_name': 'Baked', 'last_name': 'Beans'}, {'first_name': 'Lovely', 'last_name': 'Spam'}, {'first_name': 'Wonderful', 'last_name': 'Spam'}]


## Working with JSON files

[Data Source](https://github.com/jackiekazil/data-wrangling/blob/master/data/chp3/data-text.json)

In [52]:
## import json library

import json

In [53]:
!ls

[34mhomework[m[m                            some_other_file.txt
[34mlabs[m[m                                week6 load&wrangling data.ipynb
my_csv_file.csv                     week6-loading&wrangling data.pdf
names.csv                           week6-loading&wrangling data.pptx
some_file.txt                       ~$week6-loading&wrangling data.pptx


In [54]:
!curl https://raw.githubusercontent.com/jackiekazil/data-wrangling/master/data/chp3/data-text.json --output data-text.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1607k  100 1607k    0     0  4078k      0 --:--:-- --:--:-- --:--:-- 4068k


In [56]:
# !cat data-text.json

In [57]:
from urllib import request

def download_file(file_name, url):
    res = request.urlopen(url)
    with open(file_name,'wb') as file:
        file.write(res.read())

In [58]:
download_file('data-text.json', 'https://raw.githubusercontent.com/jackiekazil/data-wrangling/master/data/chp3/data-text.json')
download_file('data-text.csv' , 'https://raw.githubusercontent.com/jackiekazil/data-wrangling/master/data/chp3/data-text.csv')
download_file('separate_jsons.txt' , 'https://raw.githubusercontent.com/msaricaumbc/DS_data/master/separate_jsons.txt')

In [59]:
!ls

data-text.csv                       some_file.txt
data-text.json                      some_other_file.txt
[34mhomework[m[m                            week6 load&wrangling data.ipynb
[34mlabs[m[m                                week6-loading&wrangling data.pdf
my_csv_file.csv                     week6-loading&wrangling data.pptx
names.csv                           ~$week6-loading&wrangling data.pptx
separate_jsons.txt


In [60]:
path = 'data-text.json'

## use path to open a the file in reading mode
with open(path, 'r') as file:
  ## Instantiate json_file object with json.load 
  json_file = json.load(file)
  ## go row by row in json file and print it
  for row in json_file:
    print(row)

{'Indicator': 'Life expectancy at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 1990, 'WHO region': 'Europe', 'World Bank income group': 'High-income', 'Country': 'Andorra', 'Sex': 'Both sexes', 'Display Value': 77, 'Numeric': 77.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 2000, 'WHO region': 'Europe', 'World Bank income group': 'High-income', 'Country': 'Andorra', 'Sex': 'Both sexes', 'Display Value': 80, 'Numeric': 80.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at age 60 (years)', 'PUBLISH STATES': 'Published', 'Year': 2012, 'WHO region': 'Europe', 'World Bank income group': 'High-income', 'Country': 'Andorra', 'Sex': 'Female', 'Display Value': 28, 'Numeric': 28.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at age 60 (years)', 'PUBLISH STATES': 'Published', 'Year': 2000, 'WHO region': 'Europe', 'World Bank income group': 'High-income'

{'Indicator': 'Life expectancy at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 1990, 'WHO region': 'Africa', 'World Bank income group': 'Lower-middle-income', 'Country': 'Cameroon', 'Sex': 'Female', 'Display Value': 56, 'Numeric': 56.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at age 60 (years)', 'PUBLISH STATES': 'Published', 'Year': 2012, 'WHO region': 'Africa', 'World Bank income group': 'Lower-middle-income', 'Country': 'Cameroon', 'Sex': 'Male', 'Display Value': 16, 'Numeric': 16.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at age 60 (years)', 'PUBLISH STATES': 'Published', 'Year': 1990, 'WHO region': 'Africa', 'World Bank income group': 'Lower-middle-income', 'Country': 'Cameroon', 'Sex': 'Female', 'Display Value': 16, 'Numeric': 16.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 2000, 'WHO region': 'Africa', 'World Bank income grou

{'Indicator': 'Life expectancy at age 60 (years)', 'PUBLISH STATES': 'Published', 'Year': 2012, 'WHO region': 'Africa', 'World Bank income group': 'Low-income', 'Country': 'Mozambique', 'Sex': 'Female', 'Display Value': 17, 'Numeric': 17.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 1990, 'WHO region': 'Africa', 'World Bank income group': 'Low-income', 'Country': 'Malawi', 'Sex': 'Female', 'Display Value': 46, 'Numeric': 46.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 2000, 'WHO region': 'Africa', 'World Bank income group': 'Low-income', 'Country': 'Malawi', 'Sex': 'Female', 'Display Value': 44, 'Numeric': 44.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Life expectancy at age 60 (years)', 'PUBLISH STATES': 'Published', 'Year': 2000, 'WHO region': 'Africa', 'World Bank income group': 'Low-income', 'Country'

{'Indicator': 'Healthy life expectancy (HALE) at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 2000, 'WHO region': 'Americas', 'World Bank income group': 'Upper-middle-income', 'Country': 'Jamaica', 'Sex': 'Female', 'Display Value': 64, 'Numeric': 64.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Healthy life expectancy (HALE) at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 2000, 'WHO region': 'Americas', 'World Bank income group': 'Upper-middle-income', 'Country': 'Jamaica', 'Sex': 'Both sexes', 'Display Value': 62, 'Numeric': 62.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Healthy life expectancy (HALE) at birth (years)', 'PUBLISH STATES': 'Published', 'Year': 2012, 'WHO region': 'Eastern Mediterranean', 'World Bank income group': 'Lower-middle-income', 'Country': 'Jordan', 'Sex': 'Female', 'Display Value': 65, 'Numeric': 65.0, 'Low': '', 'High': '', 'Comments': ''}
{'Indicator': 'Healthy life expectancy (HALE) at birth (years)', 'PUBLISH STAT

In [61]:
json_file

[{'Indicator': 'Life expectancy at birth (years)',
  'PUBLISH STATES': 'Published',
  'Year': 1990,
  'WHO region': 'Europe',
  'World Bank income group': 'High-income',
  'Country': 'Andorra',
  'Sex': 'Both sexes',
  'Display Value': 77,
  'Numeric': 77.0,
  'Low': '',
  'High': '',
  'Comments': ''},
 {'Indicator': 'Life expectancy at birth (years)',
  'PUBLISH STATES': 'Published',
  'Year': 2000,
  'WHO region': 'Europe',
  'World Bank income group': 'High-income',
  'Country': 'Andorra',
  'Sex': 'Both sexes',
  'Display Value': 80,
  'Numeric': 80.0,
  'Low': '',
  'High': '',
  'Comments': ''},
 {'Indicator': 'Life expectancy at age 60 (years)',
  'PUBLISH STATES': 'Published',
  'Year': 2012,
  'WHO region': 'Europe',
  'World Bank income group': 'High-income',
  'Country': 'Andorra',
  'Sex': 'Female',
  'Display Value': 28,
  'Numeric': 28.0,
  'Low': '',
  'High': '',
  'Comments': ''},
 {'Indicator': 'Life expectancy at age 60 (years)',
  'PUBLISH STATES': 'Published',
  '

[For more working with JSON in Python](https://realpython.com/python-json/)

In [62]:
import pandas as pd
pd.read_json(path)

Unnamed: 0,Indicator,PUBLISH STATES,Year,WHO region,World Bank income group,Country,Sex,Display Value,Numeric,Low,High,Comments
0,Life expectancy at birth (years),Published,1990,Europe,High-income,Andorra,Both sexes,77,77,,,
1,Life expectancy at birth (years),Published,2000,Europe,High-income,Andorra,Both sexes,80,80,,,
2,Life expectancy at age 60 (years),Published,2012,Europe,High-income,Andorra,Female,28,28,,,
3,Life expectancy at age 60 (years),Published,2000,Europe,High-income,Andorra,Both sexes,23,23,,,
4,Life expectancy at birth (years),Published,2012,Eastern Mediterranean,High-income,United Arab Emirates,Female,78,78,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
4651,Healthy life expectancy (HALE) at birth (years),Published,2012,Western Pacific,Lower-middle-income,Samoa,Female,66,66,,,
4652,Healthy life expectancy (HALE) at birth (years),Published,2012,Eastern Mediterranean,Low-income,Yemen,Both sexes,54,54,,,
4653,Healthy life expectancy (HALE) at birth (years),Published,2000,Africa,Upper-middle-income,South Africa,Male,49,49,,,
4654,Healthy life expectancy (HALE) at birth (years),Published,2000,Africa,Low-income,Zambia,Both sexes,36,36,,,


In [59]:
!cat test.json

cat: test.json: No such file or directory


In [63]:
pd.read_json('separate_jsons.txt', lines=True)

Unnamed: 0,a,b
0,3,5
1,3,5
2,3,5
3,3,5
4,3,5
5,3,5
6,3,5
7,3,5


## HTML files

In [64]:
## With pandas you can read some html files
## my experience with this method is hit and miss but sometimes it's worthwhile to give a shot.
world_cups = pd.read_html('https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_finals')

In [68]:
world_cups[3]

Unnamed: 0,Year,Winners,Score[2],Runners-up,Venue,Location,Attendance,Refs
0,1930,Uruguay,4–2,Argentina,Estadio Centenario,"Montevideo, Uruguay",68346,[7][8]
1,1934,Italy,2–1 (a.e.t.),Czechoslovakia,Stadio Nazionale PNF,"Rome, Italy",55000,[9][10]
2,1938,Italy,4–2,Hungary,Stade Olympique de Colombes,"Colombes (Paris), France",45000,[11][12]
3,1950,Uruguay,2–1[n 3],Brazil,Maracanã Stadium,"Rio de Janeiro, Brazil",173850,[13][14]
4,1954,West Germany,3–2,Hungary,Wankdorf Stadium,"Bern, Switzerland",62500,[15][16]
5,1958,Brazil,5–2,Sweden,Råsunda Stadium,"Solna (Stockholm), Sweden",51800,[17][18]
6,1962,Brazil,3–1,Czechoslovakia,Estadio Nacional,"Santiago, Chile",69000,[19][20]
7,1966,England,4–2 (a.e.t.),West Germany,Wembley Stadium,"London, England",96924,[21][22]
8,1970,Brazil,4–1,Italy,Estadio Azteca,"Mexico City, Mexico",107412,[23][24]
9,1974,West Germany,2–1,Netherlands,Olympiastadion,"Munich, West Germany",75200,[25][26]


## Working with Zipped Files

In [69]:
from zipfile import ZipFile

__Load a zip file before continue__

In [70]:
file_url = 'https://github.com/msaricaumbc/DS_data/blob/master/world_cup.csv.zip?raw=true'
file_name = 'world_cup.zip'

download_file(file_name, file_url)

In [71]:
def unzip(file_name, path='./'):
    # opening the zip file in READ mode 
    with ZipFile(file_name, 'r') as zip: 
        # printing all the contents of the zip file 
        zip.printdir() 

        # extracting all the files 
        print('Extracting all the files now...') 
        zip.extractall(path = path) 
        print('Done!') 

In [72]:
unzip(file_name)

File Name                                             Modified             Size
world_cup.csv                                  2021-10-05 22:45:02       164603
__MACOSX/._world_cup.csv                       2021-10-05 22:45:02          176
Extracting all the files now...
Done!


In [73]:
!ls

[34m__MACOSX[m[m                            some_file.txt
data-text.csv                       some_other_file.txt
data-text.json                      week6 load&wrangling data.ipynb
[34mhomework[m[m                            week6-loading&wrangling data.pdf
[34mlabs[m[m                                week6-loading&wrangling data.pptx
my_csv_file.csv                     world_cup.csv
names.csv                           world_cup.zip
separate_jsons.txt                  ~$week6-loading&wrangling data.pptx


In [74]:
pd.read_csv('world_cup.csv')

Unnamed: 0,Year,Datetime,Stage,Stadium,City,Home Team Name,Home Team Goals,Away Team Goals,Away Team Name,Win conditions,Attendance,Half-time Home Goals,Half-time Away Goals,Referee,Assistant 1,Assistant 2,RoundID,MatchID,Home Team Initials,Away Team Initials
0,1930,13 Jul 1930 - 15:00,Group 1,Pocitos,Montevideo,France,4,1,Mexico,,4444.0,3,0,LOMBARDI Domingo (URU),CRISTOPHE Henry (BEL),REGO Gilberto (BRA),201,1096,FRA,MEX
1,1930,13 Jul 1930 - 15:00,Group 4,Parque Central,Montevideo,USA,3,0,Belgium,,18346.0,2,0,MACIAS Jose (ARG),MATEUCCI Francisco (URU),WARNKEN Alberto (CHI),201,1090,USA,BEL
2,1930,14 Jul 1930 - 12:45,Group 2,Parque Central,Montevideo,Yugoslavia,2,1,Brazil,,24059.0,2,0,TEJADA Anibal (URU),VALLARINO Ricardo (URU),BALWAY Thomas (FRA),201,1093,YUG,BRA
3,1930,14 Jul 1930 - 14:50,Group 3,Pocitos,Montevideo,Romania,3,1,Peru,,2549.0,1,0,WARNKEN Alberto (CHI),LANGENUS Jean (BEL),MATEUCCI Francisco (URU),201,1098,ROU,PER
4,1930,15 Jul 1930 - 16:00,Group 1,Parque Central,Montevideo,Argentina,1,0,France,,23409.0,0,0,REGO Gilberto (BRA),SAUCEDO Ulises (BOL),RADULESCU Constantin (ROU),201,1085,ARG,FRA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
847,2014,05 Jul 2014 - 17:00,Quarter-finals,Arena Fonte Nova,Salvador,Netherlands,0,0,Costa Rica,Netherlands win on penalties (4 - 3),51179.0,0,0,Ravshan IRMATOV (UZB),RASULOV Abduxamidullo (UZB),KOCHKAROV Bakhadyr (KGZ),255953,300186488,NED,CRC
848,2014,08 Jul 2014 - 17:00,Semi-finals,Estadio Mineirao,Belo Horizonte,Brazil,1,7,Germany,,58141.0,0,5,RODRIGUEZ Marco (MEX),TORRENTERA Marvin (MEX),QUINTERO Marcos (MEX),255955,300186474,BRA,GER
849,2014,09 Jul 2014 - 17:00,Semi-finals,Arena de Sao Paulo,Sao Paulo,Netherlands,0,0,Argentina,Argentina win on penalties (2 - 4),63267.0,0,0,C�neyt �AKIR (TUR),DURAN Bahattin (TUR),ONGUN Tarik (TUR),255955,300186490,NED,ARG
850,2014,12 Jul 2014 - 17:00,Play-off for third place,Estadio Nacional,Brasilia,Brazil,0,3,Netherlands,,68034.0,0,2,HAIMOUDI Djamel (ALG),ACHIK Redouane (MAR),ETCHIALI Abdelhak (ALG),255957,300186502,BRA,NED


For more on extracting and writing zip files: https://www.geeksforgeeks.org/working-zip-files-python/

## Encoding

In [75]:
download_file('PoliceShootingsUS.csv','https://raw.githubusercontent.com/msaricaumbc/DS_data/master/PoliceShootingsUS.csv')

In [76]:
!head PoliceShootingsUS.csv

id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera3,Tim Elliot,02/01/15,shot,gun,53,M,A,Shelton,WA,TRUE,attack,Not fleeing,FALSE4,Lewis Lee Lembke,02/01/15,shot,gun,47,M,W,Aloha,OR,FALSE,attack,Not fleeing,FALSE5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23,M,H,Wichita,KS,FALSE,other,Not fleeing,FALSE8,Matthew Hoffman,04/01/15,shot,toy weapon,32,M,W,San Francisco,CA,TRUE,attack,Not fleeing,FALSE9,Michael Rodriguez,04/01/15,shot,nail gun,39,M,H,Evans,CO,FALSE,attack,Not fleeing,FALSE11,Kenneth Joe Brown,04/01/15,shot,gun,18,M,W,Guthrie,OK,FALSE,attack,Not fleeing,FALSE13,Kenneth Arnold Buck,05/01/15,shot,gun,22,M,H,Chandler,AZ,FALSE,attack,Car,FALSE15,Brock Nichols,06/01/15,shot,gun,35,M,W,Assaria,KS,FALSE,attack,Not fleeing,FALSE16,Autumn Steele,06/01/15,shot,unarmed,34,F,W,Burlington,IA,FALSE,other,Not fleeing,TRUE17,Leslie Sapp III,06/01/15,shot,toy weapon,47,M,B,Knoxville,PA,FALSE,attack,Not fleein

ephen Steele,13/07/17,shot,knife,56,M,,Crystal Springs,FL,TRUE,other,Not fleeing,FALSE2776,Pedro Rubio,13/07/17,shot,knife,42,M,,Goodyear,AZ,FALSE,attack,Not fleeing,FALSE2777,Ramiro Bravo Ramirez,14/07/17,shot,gun,34,M,,Greenville,SC,FALSE,attack,Not fleeing,TRUE2778,Gerber Dieguez,15/07/17,shot,undetermined,29,M,,Los Angeles,CA,FALSE,undetermined,Foot,FALSE2779,Justine Damond,15/07/17,shot,unarmed,40,F,W,Minneapolis,MN,FALSE,undetermined,Not fleeing,FALSE2780,TK TK,15/07/17,shot,gun,23,M,,Moreno Valley,CA,TRUE,other,Not fleeing,FALSE2781,TK TK,15/07/17,shot,gun,,M,,Arlington,TX,FALSE,attack,Foot,FALSE2782,India N. Nelson,17/07/17,shot,gun,25,F,,Norfolk,VA,FALSE,attack,Not fleeing,FALSE2783,Ernesto Sedillo,17/07/17,shot,gun,52,M,,Las Cruces,NM,FALSE,attack,Not fleeing,FALSE2784,Eric Wesley Clark,17/07/17,shot,gun,43,M,,Culpeper,VA,FALSE,attack,Car,FALSE2785,Jose Cazares,17/07/17,shot,gun,37,M,H,San Antonio,TX,FALSE,attack,Not fleeing,FALSE2786,Daniel Thomas Reid,18/07/17,sh

In [78]:
pd.read_csv('PoliceShootingsUS.csv')

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
0,3,Tim Elliot,02/01/15,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False
1,4,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False
2,5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False
3,8,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False
4,9,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2530,2822,Rodney E. Jacobs,28/07/17,shot,gun,31.0,M,,Kansas City,MO,False,attack,Not fleeing,False
2531,2813,TK TK,28/07/17,shot,vehicle,,M,,Albuquerque,NM,False,attack,Car,False
2532,2818,Dennis W. Robinson,29/07/17,shot,gun,48.0,M,,Melba,ID,False,attack,Car,False
2533,2817,Isaiah Tucker,31/07/17,shot,vehicle,28.0,M,B,Oshkosh,WI,False,attack,Car,True


In [80]:
!pip install chardet
import chardet



In [81]:

with open('PoliceShootingsUS.csv', 'rb') as f:
    content = f.read()
#     print(content)
    result = chardet.detect(content)
    print(result)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


In [82]:
chardet.detect(open('PoliceShootingsUS.csv', 'rb').read())

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

In [83]:
pd.read_csv('PoliceShootingsUS.csv', encoding='Windows-1252')

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
0,3,Tim Elliot,02/01/15,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False
1,4,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False
2,5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False
3,8,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False
4,9,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2530,2822,Rodney E. Jacobs,28/07/17,shot,gun,31.0,M,,Kansas City,MO,False,attack,Not fleeing,False
2531,2813,TK TK,28/07/17,shot,vehicle,,M,,Albuquerque,NM,False,attack,Car,False
2532,2818,Dennis W. Robinson,29/07/17,shot,gun,48.0,M,,Melba,ID,False,attack,Car,False
2533,2817,Isaiah Tucker,31/07/17,shot,vehicle,28.0,M,B,Oshkosh,WI,False,attack,Car,True


## Lab

In [84]:
# clean up after class
!rm *.txt
!rm *.csv
!rm *.json
!rm *.zip
!rm -r __MACOSX