Načtení souboru do DataFrame, mapování základních charakteristik dat

In [1]:
import pandas as pd 
new_census = pd.read_csv('ChicagoCensusData.csv') 
print(new_census.info())
print(new_census.head(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 78 entries, 0 to 77
Data columns (total 9 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   COMMUNITY_AREA_NUMBER                         77 non-null     float64
 1   COMMUNITY_AREA_NAME                           78 non-null     object 
 2   PERCENT_OF_HOUSING_CROWDED                    78 non-null     float64
 3   PERCENT_HOUSEHOLDS_BELOW_POVERTY              78 non-null     float64
 4   PERCENT_AGED_16__UNEMPLOYED                   78 non-null     float64
 5   PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA  78 non-null     float64
 6   PERCENT_AGED_UNDER_18_OR_OVER_64              78 non-null     float64
 7   PER_CAPITA_INCOME                             78 non-null     int64  
 8   HARDSHIP_INDEX                                77 non-null     float64
dtypes: float64(7), int64(1), object(1)
memory usage: 5.6+ KB
None
   COM

Příprava souboru “census.csv”

Odstranění řádků s chybějícími hodnotami,  kontrola duplicit, uložení do nového souboru .csv

In [2]:
import pandas as pd 

has_duplicates = new_census.duplicated().any()
if has_duplicates:
    print("V datech jsou duplicity.")

else:
    print("V datech nejsou žádné duplicity.")



V datech nejsou žádné duplicity.


In [3]:
import pandas as pd 

new_census_b = new_census.dropna()
print("Řádky s chybějícími hodnotami byly odstraněny.")

new_census_b.to_csv("census.csv", index = False)
print("Nový soubor census.csv byl uložen.")

Řádky s chybějícími hodnotami byly odstraněny.
Nový soubor census.csv byl uložen.


Příprava souboru “crime.csv”

Změna datového typu sloupce "DATE" na datetime, kontrola duplicit, odstranění řádků s prázdným hodnotami, uložení do nového souboru .csv

In [4]:
import pandas as pd 
new_crime = pd.read_csv('ChicagoCrimeData.csv') 
print(new_crime.info())
print(new_crime.head(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 533 entries, 0 to 532
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     533 non-null    int64  
 1   CASE_NUMBER            533 non-null    object 
 2   DATE                   533 non-null    object 
 3   BLOCK                  533 non-null    object 
 4   IUCR                   533 non-null    object 
 5   PRIMARY_TYPE           533 non-null    object 
 6   DESCRIPTION            533 non-null    object 
 7   LOCATION_DESCRIPTION   533 non-null    object 
 8   ARREST                 533 non-null    bool   
 9   DOMESTIC               533 non-null    bool   
 10  BEAT                   533 non-null    int64  
 11  DISTRICT               533 non-null    int64  
 12  WARD                   490 non-null    float64
 13  COMMUNITY_AREA_NUMBER  490 non-null    float64
 14  FBICODE                533 non-null    object 
 15  X_COOR

In [5]:
import pandas as pd
new_crime['DATE'] = pd.to_datetime(new_crime['DATE'])

print("Datatyp 'DATA' byl aktualizován.")

has_duplicates = new_crime.duplicated().any()
if has_duplicates:
    print("V datech jsou duplicity.")

else:
    print("V datech nejsou žádné duplicity.")

Datatyp 'DATA' byl aktualizován.
V datech nejsou žádné duplicity.


In [6]:
import pandas as pd

new_crime_b = new_crime.dropna()
print("Řádky s chybějícími hodnotami byly odstraněny.")

new_crime_b.to_csv("crime.csv", index = False)

print("Nový soubor crime.csv byl uložen.")

Řádky s chybějícími hodnotami byly odstraněny.
Nový soubor crime.csv byl uložen.


Příprava souboru “schools.csv”

Výběr sloupců a jejich uložení, kontrola duplicit, do nového souboru .csv.

In [7]:
import pandas as pd 
df3 = pd.read_csv('ChicagoPublicSchools.csv') 
print(df3.info())
print(df3.head(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 566 entries, 0 to 565
Data columns (total 78 columns):
 #   Column                                            Non-Null Count  Dtype  
---  ------                                            --------------  -----  
 0   School_ID                                         566 non-null    int64  
 1   NAME_OF_SCHOOL                                    566 non-null    object 
 2   Elementary, Middle, or High School                566 non-null    object 
 3   Street_Address                                    566 non-null    object 
 4   City                                              566 non-null    object 
 5   State                                             566 non-null    object 
 6   ZIP_Code                                          566 non-null    int64  
 7   Phone_Number                                      566 non-null    object 
 8   Link                                              565 non-null    object 
 9   Network_Manager      

In [8]:
schools_selected = ['School_ID', 'NAME_OF_SCHOOL', 'HEALTHY_SCHOOL_CERTIFIED','COMMUNITY_AREA_NUMBER','COMMUNITY_AREA_NAME']
new_schools = df3[schools_selected]

print("Sloupce z tabulky ChicagoPublicSchools byly vybrány a vloženy do new_schools.")

has_duplicates = new_schools.duplicated().any()
if has_duplicates:
    print("V datech jsou duplicity.")

else:
    print("V datech nejsou žádné duplicity.")

Sloupce z tabulky ChicagoPublicSchools byly vybrány a vloženy do new_schools.
V datech nejsou žádné duplicity.


In [9]:
new_schools.to_csv("schools.csv", index = False)
print("Nový soubor schools.csv byl uložen.")

Nový soubor schools.csv byl uložen.


Vytvoření SQLite databáze a vložení souborů

In [10]:
import sqlite3
import pandas as pd

conn = sqlite3.connect('chicago_database.db')
data_a = pd.read_csv('census.csv')
data_b = pd.read_csv('crime.csv')
data_c = pd.read_csv('schools.csv')

data_a.to_sql("census", conn, if_exists='replace', index=False)
data_b.to_sql("crime", conn, if_exists='replace', index=False)
data_c.to_sql("schools", conn, if_exists='replace', index=False)

print("Data byla vložena do databáze.")


Data byla vložena do databáze.


Smazání záznamů o činech (tabulka “crime”), které se udály po začátku školního roku 2011/2012

In [13]:
%load_ext sql
import prettytable
prettytable.DEFAULT = 'DEFAULT'

%sql sqlite:///chicago_database.db
%sql DELETE FROM crime WHERE DATE > '2011-08-31'

 * sqlite:///chicago_database.db
166 rows affected.


[]

Vytvoření náhledů pro rychlou orientaci v datech

In [14]:
%load_ext sql
import prettytable
prettytable.DEFAULT = 'DEFAULT'

%sql sqlite:///chicago_database.db
%sql SELECT * FROM census LIMIT 5

The sql extension is already loaded. To reload it, use:
  %reload_ext sql
 * sqlite:///chicago_database.db
Done.


COMMUNITY_AREA_NUMBER,COMMUNITY_AREA_NAME,PERCENT_OF_HOUSING_CROWDED,PERCENT_HOUSEHOLDS_BELOW_POVERTY,PERCENT_AGED_16__UNEMPLOYED,PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA,PERCENT_AGED_UNDER_18_OR_OVER_64,PER_CAPITA_INCOME,HARDSHIP_INDEX
1.0,Rogers Park,7.7,23.6,8.7,18.2,27.5,23939,39.0
2.0,West Ridge,7.8,17.2,8.8,20.8,38.5,23040,46.0
3.0,Uptown,3.8,24.0,8.9,11.8,22.2,35787,20.0
4.0,Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524,17.0
5.0,North Center,0.3,7.5,5.2,4.5,26.2,57123,6.0


In [15]:
%load_ext sql
import prettytable
prettytable.DEFAULT = 'DEFAULT'

%sql sqlite:///chicago_database.db
%sql SELECT * FROM crime LIMIT 5

The sql extension is already loaded. To reload it, use:
  %reload_ext sql
 * sqlite:///chicago_database.db
Done.


ID,CASE_NUMBER,DATE,BLOCK,IUCR,PRIMARY_TYPE,DESCRIPTION,LOCATION_DESCRIPTION,ARREST,DOMESTIC,BEAT,DISTRICT,WARD,COMMUNITY_AREA_NUMBER,FBICODE,X_COORDINATE,Y_COORDINATE,YEAR,LATITUDE,LONGITUDE,LOCATION
3512276,HK587712,2004-08-28,047XX S KEDZIE AVE,890,THEFT,FROM BUILDING,SMALL RETAIL STORE,0,0,911,9,14.0,58.0,6,1155838.0,1873050.0,2004,41.8074405,-87.70395585,"(41.8074405, -87.703955849)"
3406613,HK456306,2004-06-26,009XX N CENTRAL PARK AVE,820,THEFT,$500 AND UNDER,OTHER,0,0,1112,11,27.0,23.0,6,1152206.0,1906127.0,2004,41.89827996,-87.71640551,"(41.898279962, -87.716405505)"
8002131,HT233595,2011-04-04,043XX S WABASH AVE,820,THEFT,$500 AND UNDER,NURSING HOME/RETIREMENT HOME,0,0,221,2,3.0,38.0,6,1177436.0,1876313.0,2011,41.81593313,-87.62464213,"(41.815933131, -87.624642127)"
7903289,HT133522,2010-12-30,083XX S KINGSTON AVE,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,0,0,423,4,7.0,46.0,6,1194622.0,1850125.0,2010,41.74366532,-87.56246276,"(41.743665322, -87.562462756)"
7732712,HS540106,2010-09-29,006XX W CHICAGO AVE,810,THEFT,OVER $500,PARKING LOT/GARAGE(NON.RESID.),0,0,1323,12,27.0,24.0,6,1171668.0,1905607.0,2010,41.89644677,-87.64493868,"(41.896446772, -87.644938678)"


In [16]:
%load_ext sql
import prettytable
prettytable.DEFAULT = 'DEFAULT'

%sql sqlite:///chicago_database.db
%sql SELECT * FROM schools LIMIT 5

The sql extension is already loaded. To reload it, use:
  %reload_ext sql
 * sqlite:///chicago_database.db
Done.


School_ID,NAME_OF_SCHOOL,HEALTHY_SCHOOL_CERTIFIED,COMMUNITY_AREA_NUMBER,COMMUNITY_AREA_NAME
610038,Abraham Lincoln Elementary School,Yes,7,LINCOLN PARK
610281,Adam Clayton Powell Paideia Community Academy Elementary School,No,43,SOUTH SHORE
610185,Adlai E Stevenson Elementary School,No,70,ASHBURN
609993,Agustin Lara Elementary Academy,No,61,NEW CITY
610513,Air Force Academy High School,Yes,34,ARMOUR SQUARE
