# WEB PROJECT: DATA ANALYSIS WITH PANDAS

From the UfoScrubbed database on UFO sightseeing, we'll explore whether there is any significant difference in sightseeing among countries and among years.

Later on, weĺl study income group and latest censuses, as an indicator of the that level a government is informed about society.  We assume the aphorim "people has the leader they deserve", i.e. if the government is informed, is a sign that the people is informed. Sightseeing might have some correlation with level of education.  

Task for the future: sum the average of different Population census (numberOfYears since last Census), to get how much a country is an example of an 'information society'.

In [564]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

## Data Wrangling

#### UFO DB:

In [565]:
ufosight = pd.read_csv("./DB/UfoScrubbed.csv")                                                                                        
# Beware, use this one. The other 'complete' database, gives some reading errors.     
countries = pd.read_csv("./DB/Country.csv")
ufosight.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,datetime,city,state,country,shape,duration (seconds),duration (hours/min),comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082
2,10/10/1955 17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667
3,10/10/1956 21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833
4,10/10/1960 20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611


In [566]:
list(ufosight)

['datetime',
 'city',
 'state',
 'country',
 'shape',
 'duration (seconds)',
 'duration (hours/min)',
 'comments',
 'date posted',
 'latitude',
 'longitude ']

**Renaming columns** that are uncomfortable to type and see:

In [567]:
ufosight = ufosight.rename(columns={'duration (seconds)': 'seconds'}) 
ufosight = ufosight.rename(columns={'duration (hours/min)': 'communicDurat'}) 
ufosight = ufosight.rename(columns={'longitude ': 'longitude'}) 
# coomunicDurat is the same duration as 'seconds', but categorized in the way humans communicate
ufosight.head(2)

#Future improvements: check that communicDurat and seconds correspond - there seem to be errors. Exceptionals though.
#         * Duration is transformed to an integer of minutes:
#           ufosight = ufosight['duration'].replace('.*hrs', '')

Unnamed: 0,datetime,city,state,country,shape,seconds,communicDurat,comments,date posted,latitude,longitude
0,10/10/1949 20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111
1,10/10/1949 21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082


Split 'datetime' into year, date and time:

In [568]:
ufosight['year'] = ufosight['datetime']
ufosight['m/d'] = ufosight['datetime']
ufosight = ufosight.rename(columns={'datetime': 'time'})

# Deletes what is not year from year col:
ufosight['year'] = ufosight['year'].replace('\s\d\d:\d\d', '', regex=True).replace('\d?\d/\d?\d/', '', regex=True)
# Deletes what is not month/day from month/day col:
ufosight['m/d'] = ufosight['m/d'].replace('\s\d\d:\d\d', '', regex=True).replace('/\d{4}', '', regex=True)
# Deletes all but time from time col:
ufosight['time'] = ufosight['time'].replace('\d?\d/\d?\d/\d{4}\s', '', regex=True)
ufosight.head()

Unnamed: 0,time,city,state,country,shape,seconds,communicDurat,comments,date posted,latitude,longitude,year,m/d
0,20:30,san marcos,tx,us,cylinder,2700,45 minutes,This event took place in early fall around 194...,4/27/2004,29.8830556,-97.941111,1949,10/10
1,21:00,lackland afb,tx,,light,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...,12/16/2005,29.38421,-98.581082,1949,10/10
2,17:00,chester (uk/england),,gb,circle,20,20 seconds,Green/Orange circular disc over Chester&#44 En...,1/21/2008,53.2,-2.916667,1955,10/10
3,21:00,edna,tx,us,circle,20,1/2 hour,My older brother and twin sister were leaving ...,1/17/2004,28.9783333,-96.645833,1956,10/10
4,20:00,kaneohe,hi,us,light,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...,1/22/2004,21.4180556,-157.803611,1960,10/10


Reorder columns:

In [569]:
ufosight = ufosight[['country', 'state', 'city', 'latitude', 'longitude', 'shape', 'year', 'm/d', 'time', 'date posted', 'seconds', 'communicDurat', 'comments']]
ufosight.head()

Unnamed: 0,country,state,city,latitude,longitude,shape,year,m/d,time,date posted,seconds,communicDurat,comments
0,us,tx,san marcos,29.8830556,-97.941111,cylinder,1949,10/10,20:30,4/27/2004,2700,45 minutes,This event took place in early fall around 194...
1,,tx,lackland afb,29.38421,-98.581082,light,1949,10/10,21:00,12/16/2005,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...
2,gb,,chester (uk/england),53.2,-2.916667,circle,1955,10/10,17:00,1/21/2008,20,20 seconds,Green/Orange circular disc over Chester&#44 En...
3,us,tx,edna,28.9783333,-96.645833,circle,1956,10/10,21:00,1/17/2004,20,1/2 hour,My older brother and twin sister were leaving ...
4,us,hi,kaneohe,21.4180556,-157.803611,light,1960,10/10,20:00,1/22/2004,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...


In [570]:
set(ufosight.country)

{'au', 'ca', 'de', 'gb', nan, 'us'}

In [571]:
ufosight[ufosight.country == 'de'].head(2)
# Figure out what 'de' stands for: it is for germany. 

Unnamed: 0,country,state,city,latitude,longitude,shape,year,m/d,time,date posted,seconds,communicDurat,comments
1332,de,,berlin (germany),52.516667,13.4,fireball,2006,10/13,00:02,10/30/2006,120,1-2 minutes,7 shooting lights&#44 followed by a formation&...
3353,de,,berlin (germany),52.516667,13.4,unknown,2012,10/20,18:00,10/30/2012,1500,25 minutes,Ovni a berlin. Sorte de tissu noir&#44 flottan...


In [572]:
ufosight.shape

(80332, 13)

Eliminate countries wih Nan:

In [573]:
ufosight['country'].replace('NaN', '')
# No quita valores Nan!
ufosight.head()
# ufosight = ufosight['country'].dropna() 

Unnamed: 0,country,state,city,latitude,longitude,shape,year,m/d,time,date posted,seconds,communicDurat,comments
0,us,tx,san marcos,29.8830556,-97.941111,cylinder,1949,10/10,20:30,4/27/2004,2700,45 minutes,This event took place in early fall around 194...
1,,tx,lackland afb,29.38421,-98.581082,light,1949,10/10,21:00,12/16/2005,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...
2,gb,,chester (uk/england),53.2,-2.916667,circle,1955,10/10,17:00,1/21/2008,20,20 seconds,Green/Orange circular disc over Chester&#44 En...
3,us,tx,edna,28.9783333,-96.645833,circle,1956,10/10,21:00,1/17/2004,20,1/2 hour,My older brother and twin sister were leaving ...
4,us,hi,kaneohe,21.4180556,-157.803611,light,1960,10/10,20:00,1/22/2004,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...


In [574]:
ufosight.shape

(80332, 13)

In [575]:
set(ufosight.country)

{'au', 'ca', 'de', 'gb', nan, 'us'}

In [576]:
# ufosight = ufosight['country'].dropna()
ufosight.head()
# Do later: don't eliminate if they have a recognizable USA state (then substitute NaN with 'us')

Unnamed: 0,country,state,city,latitude,longitude,shape,year,m/d,time,date posted,seconds,communicDurat,comments
0,us,tx,san marcos,29.8830556,-97.941111,cylinder,1949,10/10,20:30,4/27/2004,2700,45 minutes,This event took place in early fall around 194...
1,,tx,lackland afb,29.38421,-98.581082,light,1949,10/10,21:00,12/16/2005,7200,1-2 hrs,1949 Lackland AFB&#44 TX. Lights racing acros...
2,gb,,chester (uk/england),53.2,-2.916667,circle,1955,10/10,17:00,1/21/2008,20,20 seconds,Green/Orange circular disc over Chester&#44 En...
3,us,tx,edna,28.9783333,-96.645833,circle,1956,10/10,21:00,1/17/2004,20,1/2 hour,My older brother and twin sister were leaving ...
4,us,hi,kaneohe,21.4180556,-157.803611,light,1960,10/10,20:00,1/22/2004,900,15 minutes,AS a Marine 1st Lt. flying an FJ4B fighter/att...


In [577]:
ufouse = ufosight[['country', 'latitude', 'longitude', 'year', 'm/d', 'time', 'seconds']]


In [578]:
ufouse.dropna().head()

Unnamed: 0,country,latitude,longitude,year,m/d,time,seconds
0,us,29.8830556,-97.941111,1949,10/10,20:30,2700
2,gb,53.2,-2.916667,1955,10/10,17:00,20
3,us,28.9783333,-96.645833,1956,10/10,21:00,20
4,us,21.4180556,-157.803611,1960,10/10,20:00,900
5,us,36.595,-82.188889,1961,10/10,19:00,300


In [579]:
ufouse.shape

(80332, 7)

In [580]:
ufouse.groupby('country').count()

Unnamed: 0_level_0,latitude,longitude,year,m/d,time,seconds
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
au,538,538,538,538,538,538
ca,3000,3000,3000,3000,3000,3000
de,105,105,105,105,105,105
gb,1905,1905,1905,1905,1905,1905
us,65114,65114,65114,65114,65114,65114


In [592]:
ufouse.groupby('year').mean().head()

Unnamed: 0_level_0,longitude
year,Unnamed: 1_level_1
1906,16.373819
1910,-94.295556
1916,2.213749
1920,-86.013333
1925,-90.015


____________________________





Correlations inside ufosight, after cleaning: 

In [588]:
# ufouse.groupby

AttributeError: 'function' object has no attribute 'head'

## Countries

In [582]:
print(list(countries))

['CountryCode', 'ShortName', 'TableName', 'LongName', 'Alpha2Code', 'CurrencyUnit', 'SpecialNotes', 'Region', 'IncomeGroup', 'Wb2Code', 'NationalAccountsBaseYear', 'NationalAccountsReferenceYear', 'SnaPriceValuation', 'LendingCategory', 'OtherGroups', 'SystemOfNationalAccounts', 'AlternativeConversionFactor', 'PppSurveyYear', 'BalanceOfPaymentsManualInUse', 'ExternalDebtReportingStatus', 'SystemOfTrade', 'GovernmentAccountingConcept', 'ImfDataDisseminationStandard', 'LatestPopulationCensus', 'LatestHouseholdSurvey', 'SourceOfMostRecentIncomeAndExpenditureData', 'VitalRegistrationComplete', 'LatestAgriculturalCensus', 'LatestIndustrialData', 'LatestTradeData', 'LatestWaterWithdrawalData']


In [583]:
print(list(countries['ShortName']))

['Afghanistan', 'Albania', 'Algeria', 'American Samoa', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Arab World', 'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon', 'Canada', 'Caribbean small states', 'Cayman Islands', 'Central African Republic', 'Central Europe and the Baltics', 'Chad', 'Channel Islands', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo', 'Costa Rica', "Côte d'Ivoire", 'Croatia', 'Cuba', 'Curaçao', 'Cyprus', 'Czech Republic', "Dem. People's Rep. Korea", 'Dem. Rep. Congo', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'East Asia & Pacific (all income levels)', 'East Asia & Pacific (developing only)', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Ethiopia', 'Euro area

In [584]:
countries.sort_values('ShortName').head()

Unnamed: 0,CountryCode,ShortName,TableName,LongName,Alpha2Code,CurrencyUnit,SpecialNotes,Region,IncomeGroup,Wb2Code,...,GovernmentAccountingConcept,ImfDataDisseminationStandard,LatestPopulationCensus,LatestHouseholdSurvey,SourceOfMostRecentIncomeAndExpenditureData,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
0,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period fo...,South Asia,Low income,AF,...,Consolidated central government,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",,2013/14,,2013.0,2000.0
1,ALB,Albania,Albania,Republic of Albania,AL,Albanian lek,,Europe & Central Asia,Upper middle income,AL,...,Budgetary central government,General Data Dissemination System (GDDS),2011,"Demographic and Health Survey (DHS), 2008/09",Living Standards Measurement Study Survey (LSM...,Yes,2012,2011.0,2013.0,2006.0
2,DZA,Algeria,Algeria,People's Democratic Republic of Algeria,DZ,Algerian dinar,,Middle East & North Africa,Upper middle income,DZ,...,Budgetary central government,General Data Dissemination System (GDDS),2008,"Multiple Indicator Cluster Survey (MICS), 2012","Integrated household survey (IHS), 1995",,,2010.0,2013.0,2001.0
3,ASM,American Samoa,American Samoa,American Samoa,AS,U.S. dollar,,East Asia & Pacific,Upper middle income,AS,...,,,2010,,,Yes,2007,,,
4,ADO,Andorra,Andorra,Principality of Andorra,AD,Euro,,Europe & Central Asia,High income: nonOECD,AD,...,,,2011. Population data compiled from administra...,,,Yes,,,2006.0,


We are interested in 5 countries:

In [585]:
countrySele = countries[(countries.ShortName == 'United States') | (countries.ShortName == 'Germany') | 
    (countries.ShortName == 'United Kingdom') | (countries.ShortName == 'Australia') | (countries.ShortName == 'Canada') ]
countrySele


Unnamed: 0,CountryCode,ShortName,TableName,LongName,Alpha2Code,CurrencyUnit,SpecialNotes,Region,IncomeGroup,Wb2Code,...,GovernmentAccountingConcept,ImfDataDisseminationStandard,LatestPopulationCensus,LatestHouseholdSurvey,SourceOfMostRecentIncomeAndExpenditureData,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
11,AUS,Australia,Australia,Commonwealth of Australia,AU,Australian dollar,Fiscal year end: June 30; reporting period for...,East Asia & Pacific,High income: OECD,AU,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Expenditure survey/budget survey (ES/BS), 2003",Yes,2011,2011.0,2013.0,2000.0
34,CAN,Canada,Canada,Canada,CA,Canadian dollar,Fiscal year end: March 31; reporting period fo...,North America,High income: OECD,CA,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Labor force survey (LFS), 2010",Yes,2011,2011.0,2013.0,1986.0
80,DEU,Germany,Germany,Federal Republic of Germany,DE,Euro,A simple multiplier is used to convert the nat...,Europe & Central Asia,High income: OECD,DE,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Integrated household survey (IHS), 2010",Yes,2010,2010.0,2013.0,2007.0
233,GBR,United Kingdom,United Kingdom,United Kingdom of Great Britain and Northern I...,GB,Pound sterling,,Europe & Central Asia,High income: OECD,GB,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Income survey (IS), 2010",Yes,2010,2010.0,2013.0,2007.0
234,USA,United States,United States,United States of America,US,U.S. dollar,Fiscal year end: September 30; reporting perio...,North America,High income: OECD,US,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2010,,"Labor force survey (LFS), 2010",Yes,2012,2008.0,2013.0,2005.0


There's no data about population, which we need to see if the number of sightseeings has any relevance. We get it from google and add it to countries:

As it was discovered that the data only had a few countries with a similar high Income, no economic index will be analyzed. Analysis by states could be done later on if found income data for states.. 

In [586]:
countrySele = countrySele[['ShortName', 'LatestPopulationCensus', 'LatestHouseholdSurvey', 'SourceOfMostRecentIncomeAndExpenditureData',
                          'LatestAgriculturalCensus', 'LatestIndustrialData', 'LatestTradeData', 'LatestWaterWithdrawalData']]
countrySele

Unnamed: 0,ShortName,LatestPopulationCensus,LatestHouseholdSurvey,SourceOfMostRecentIncomeAndExpenditureData,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
11,Australia,2011,,"Expenditure survey/budget survey (ES/BS), 2003",2011,2011.0,2013.0,2000.0
34,Canada,2011,,"Labor force survey (LFS), 2010",2011,2011.0,2013.0,1986.0
80,Germany,2011,,"Integrated household survey (IHS), 2010",2010,2010.0,2013.0,2007.0
233,United Kingdom,2011,,"Income survey (IS), 2010",2010,2010.0,2013.0,2007.0
234,United States,2010,,"Labor force survey (LFS), 2010",2012,2008.0,2013.0,2005.0


In [595]:
countrySele.index.values

array([ 11,  34,  80, 233, 234])

In [597]:
populat = pd.DataFrame({'Australia': 24.6,  
                      'Canada': 36.71, 
                      'Germany': 82.79,
                      'United Kingdom': 66.02, 
                      'United States': 325.7,}, index = [1])
populat.T.reset_index()

Unnamed: 0,index,1
0,Australia,24.6
1,Canada,36.71
2,Germany,82.79
3,United Kingdom,66.02
4,United States,325.7


In [601]:
populat = populat.rename(columns={'Index': 'ShortName'}) 
populat = populat.T.reset_index()
# populat.columns=['ShortName','pop']
populat

Unnamed: 0,index,0,1,2,3,4,5
0,index,ShortName,0,1,2,3,4
1,0,ShortName,Australia,Canada,Germany,United Kingdom,United States
2,1,1,24.6,36.71,82.79,66.02,325.7


In [602]:
pd.merge(countrySele, populat, how= 'left', on='ShortName')

KeyError: 'ShortName'