# Week 4 Notes - Getting and Cleaning Data
Final week of the course! Let's get it! 

## Editing text variables
Imagine we have imported a data frame, and upon perusing it, we realise that the column names are formatted inconsistently, with variable cases, numbers and other characters in them, perhaps spaces too, making them more difficult to work with than they need to be - what can we do? 

In [2]:
using DataFrames, CSV, DataFramesMeta, HTTP, Dates

### mixed case formatting
In R we can use the **tolower()** function alongside the **names()** function to transform the column names to lower case - there is also likely an opposite **toupper()** function as well. 
```R
tolower(names(data))
```

In Julia we can directly rename the column names of the data frame to lowercase using the **rename()** function of DataFrames package
```julia
rename(lowercase, df)
```

And then some playing around with the functions directly

In [6]:
sample_names = ["MusicData", "tommorrowTemp", "AttRibutes"]

3-element Vector{String}:
 "MusicData"
 "tommorrowTemp"
 "AttRibutes"

In [7]:
lowercase.(sample_names)

3-element Vector{String}:
 "musicdata"
 "tommorrowtemp"
 "attributes"

### Stripping, splitting characters 

Say we have a variable which contains a dot followed by a number, and we're only interested in the string before the dot, we can strip and split these characters based on the character. In R;
```R
splitNames = strsplit(names(cameraData, "\\.")
```

In [17]:
sample_split = ["MusicData.3", "tommorrowTemp.3", "AttRibutes"]

3-element Vector{String}:
 "MusicData.3"
 "tommorrowTemp.3"
 "AttRibutes"

Replace all digits and numbers with nothing - effectively deleting it. 

In [45]:
replace.(sample_split, r"\d" => "", "." => "")

3-element Vector{String}:
 "MusicData"
 "tommorrowTemp"
 "AttRibutes"

This would be  done with something like **gsub** in R;
```R
gsub("_", "", testName)
```

Split the variable names on the dot character

In [46]:
split_names = split.(sample_split, ".")

3-element Vector{Vector{SubString{String}}}:
 ["MusicData", "3"]
 ["tommorrowTemp", "3"]
 ["AttRibutes"]

In [50]:
split_names[1][1]

"MusicData"

### Finding values - grep and so on
R borrows directly from the GNU grep function, whereas Julia uses it's own **find**, **findall** etc. functions for this purpose - along with **in**, **all** etc.. 

Let's see R's;
```R
grep("Alameda", data)
table(grepl("Alameda", data))
```

In julia we could use **findall()**;
```julia
findall("something", data)
```

## Dates and Time
The best guide on this in julia is found at https://en.wikibooks.org/wiki/Introducing_Julia/Working_with_dates_and_times -- very, very helpful 

## Quiz

## 1. 
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here: 

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv

and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf 

Apply strsplit() to split all the names of the data frame on the characters "wgtp". What is the value of the 123 element of the resulting list? 

In [9]:
q1_csv = CSV.read(HTTP.get("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv").body, DataFrame)

Row,RT,SERIALNO,DIVISION,PUMA,REGION,ST,ADJUST,WGTP,NP,TYPE,ACR,AGS,BDS,BLD,BUS,CONP,ELEP,FS,FULP,GASP,HFL,INSP,KIT,MHP,MRGI,MRGP,MRGT,MRGX,PLM,RMS,RNTM,RNTP,SMP,TEL,TEN,VACS,VAL,VEH,WATP,YBL,FES,FINCP,FPARC,GRNTP,GRPIP,HHL,HHT,HINCP,HUGCL,HUPAC,HUPAOC,HUPARC,LNGI,MV,NOC,NPF,NPP,NR,NRC,OCPIP,PARTNER,PSF,R18,R60,R65,RESMODE,SMOCP,SMX,SRNT,SVAL,TAXP,WIF,WKEXREL,WORKSTAT,FACRP,FAGSP,FBDSP,FBLDP,FBUSP,FCONP,FELEP,FFSP,FFULP,FGASP,FHFLP,FINSP,FKITP,FMHP,FMRGIP,FMRGP,FMRGTP,FMRGXP,FMVYP,FPLMP,FRMSP,FRNTMP,FRNTP,FSMP,FSMXHP,FSMXSP,⋯
Unnamed: 0_level_1,String1,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,Int64?,⋯
1,H,186,8,700,4,16,1015675,89,4,1,1,missing,4,2,2,missing,180,0,2,3,3,600,1,missing,1,1300,1,1,1,9,missing,missing,missing,1,1,missing,17,3,840,5,2,105600,2,missing,missing,1,1,105600,0,2,2,2,1,4,2,4,0,0,2,18,0,0,1,0,0,1,1550,3,0,1,24,3,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
2,H,306,8,700,4,16,1015675,310,1,1,missing,missing,1,7,missing,missing,60,0,2,3,3,missing,1,missing,missing,missing,missing,missing,1,2,2,600,missing,1,3,missing,missing,1,1,3,missing,missing,missing,660,23,1,4,34000,0,4,4,4,1,3,0,missing,0,0,0,missing,0,0,0,0,0,2,missing,missing,1,0,missing,missing,missing,missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
3,H,395,8,100,4,16,1015675,106,2,1,1,missing,3,2,2,missing,70,0,2,30,1,200,1,missing,missing,missing,missing,3,1,7,missing,missing,missing,1,2,missing,18,2,50,5,7,9400,2,missing,missing,1,3,9400,0,2,2,2,1,2,1,2,0,0,1,23,0,0,1,0,0,1,179,missing,0,1,16,1,13,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
4,H,506,8,700,4,16,1015675,240,4,1,1,missing,4,2,2,missing,40,0,2,80,1,200,1,missing,1,860,1,1,1,6,missing,missing,400,1,1,missing,19,3,500,2,1,66000,1,missing,missing,1,1,66000,0,1,1,1,1,3,2,4,0,0,2,26,0,0,1,0,0,2,1422,1,0,1,31,2,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
5,H,835,8,800,4,16,1015675,118,4,1,2,1,5,2,2,missing,250,0,2,3,3,700,1,missing,1,1900,1,1,1,7,missing,missing,650,1,1,missing,20,5,2,3,1,93000,2,missing,missing,1,1,93000,0,2,2,2,1,1,1,4,0,0,1,36,0,0,1,0,0,1,2800,1,0,1,25,3,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
6,H,989,8,700,4,16,1015675,115,4,1,1,missing,3,2,2,missing,130,0,2,3,3,250,1,missing,1,700,1,1,1,6,missing,missing,400,1,1,missing,15,2,1200,5,2,61000,1,missing,missing,1,1,61000,0,1,1,1,1,4,2,4,0,0,2,26,0,0,1,0,0,2,1330,2,0,1,7,1,7,3,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
7,H,1861,8,700,4,16,1015675,0,1,2,missing,missing,missing,missing,missing,missing,missing,0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,5,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,0,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,⋯
8,H,2120,8,200,4,16,1015675,35,1,1,1,missing,2,1,2,missing,40,0,480,3,4,missing,1,missing,missing,missing,missing,missing,1,4,missing,missing,missing,1,4,missing,missing,1,650,5,missing,missing,missing,missing,missing,1,6,10400,0,4,4,4,1,5,0,missing,0,0,0,missing,0,0,0,1,1,2,missing,missing,1,0,missing,missing,missing,missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
9,H,2278,8,400,4,16,1015675,47,2,1,1,missing,3,2,2,missing,2,0,2,3,3,770,1,missing,1,750,1,1,1,6,missing,missing,missing,1,1,missing,13,2,660,3,2,209000,4,missing,missing,1,1,209000,0,4,4,4,1,1,0,2,0,0,0,5,0,0,0,1,1,1,805,3,0,1,22,1,6,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,⋯
10,H,2428,8,500,4,16,1015675,51,2,1,1,missing,2,1,2,missing,20,0,2,140,1,120,1,220,missing,missing,missing,3,1,5,missing,missing,missing,1,2,missing,1,2,2,5,missing,missing,missing,missing,missing,2,5,35400,0,4,4,4,2,1,0,missing,0,1,0,7,0,0,0,0,0,1,196,missing,0,0,4,missing,missing,missing,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,⋯


In [7]:
colnames = names(q1_csv)

188-element Vector{String}:
 "RT"
 "SERIALNO"
 "DIVISION"
 "PUMA"
 "REGION"
 "ST"
 "ADJUST"
 "WGTP"
 "NP"
 "TYPE"
 "ACR"
 "AGS"
 "BDS"
 ⋮
 "wgtp69"
 "wgtp70"
 "wgtp71"
 "wgtp72"
 "wgtp73"
 "wgtp74"
 "wgtp75"
 "wgtp76"
 "wgtp77"
 "wgtp78"
 "wgtp79"
 "wgtp80"

In [15]:
split_colnames = split.(colnames, "wgtp")

188-element Vector{Vector{SubString{String}}}:
 ["RT"]
 ["SERIALNO"]
 ["DIVISION"]
 ["PUMA"]
 ["REGION"]
 ["ST"]
 ["ADJUST"]
 ["WGTP"]
 ["NP"]
 ["TYPE"]
 ["ACR"]
 ["AGS"]
 ["BDS"]
 ⋮
 ["", "69"]
 ["", "70"]
 ["", "71"]
 ["", "72"]
 ["", "73"]
 ["", "74"]
 ["", "75"]
 ["", "76"]
 ["", "77"]
 ["", "78"]
 ["", "79"]
 ["", "80"]

In [17]:
split_colnames[123]

2-element Vector{SubString{String}}:
 ""
 "15"

The answer is  "" "15"

## 2. 
Load the Gross Domestic Product data for the 190 ranked countries in this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv

Remove the commas from the GDP numbers in millions of dollars and average them. What is the average?

Original data sources:

http://data.worldbank.org/data-catalog/GDP-ranking-table 

In [22]:
gdp = download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", "gdp.csv") 

"gdp.csv"

In [23]:
run(`head gdp.csv`)

,Gross domestic product 2012,,,,,,,,
,,,,,,,,,
,,,,(millions of,,,,,
,Ranking,,Economy,US dollars),,,,,
,,,,,,,,,
USA,1,,United States," 16,244,600 ",,,,,
CHN,2,,China," 8,227,103 ",,,,,
JPN,3,,Japan," 5,959,718 ",,,,,
DEU,4,,Germany," 3,428,131 ",,,,,
FRA,5,,France," 2,612,878 ",,,,,


Process(`[4mhead[24m [4mgdp.csv[24m`, ProcessExited(0))

In [31]:
gdp_raw = CSV.read("gdp.csv", DataFrame, skipto=6, header=4) ; gdp_raw_nomissing = dropmissing(gdp_raw[:, [:1, :2, :4, :5]]) ; gdp_cleaned = rename!(gdp_raw_nomissing, ["CountryCode", "Rank", "Country", "US_dollars"])

Row,CountryCode,Rank,Country,US_dollars
Unnamed: 0_level_1,String3,String,String31,String15
1,USA,1,United States,16244600
2,CHN,2,China,8227103
3,JPN,3,Japan,5959718
4,DEU,4,Germany,3428131
5,FRA,5,France,2612878
6,GBR,6,United Kingdom,2471784
7,BRA,7,Brazil,2252664
8,RUS,8,Russian Federation,2014775
9,ITA,9,Italy,2014670
10,IND,10,India,1841710


In [32]:
gdp_cleaned.US_dollars .= replace.(gdp_cleaned.US_dollars, "," => "")

190-element Vector{String}:
 " 16244600 "
 " 8227103 "
 " 5959718 "
 " 3428131 "
 " 2612878 "
 " 2471784 "
 " 2252664 "
 " 2014775 "
 " 2014670 "
 " 1841710 "
 " 1821424 "
 " 1532408 "
 " 1322965 "
 ⋮
 " 767 "
 " 713 "
 " 684 "
 " 596 "
 " 480 "
 " 472 "
 " 326 "
 " 263 "
 " 228 "
 " 182 "
 " 175 "
 " 40 "

In [33]:
gdp_cleaned.US_dollars = parse.(Int, gdp_cleaned.US_dollars)

190-element Vector{Int64}:
 16244600
  8227103
  5959718
  3428131
  2612878
  2471784
  2252664
  2014775
  2014670
  1841710
  1821424
  1532408
  1322965
        ⋮
      767
      713
      684
      596
      480
      472
      326
      263
      228
      182
      175
       40

In [34]:
using StatsBase

In [35]:
mean(gdp_cleaned.US_dollars)

377652.4210526316

The answer is 377652.4210526316 

## 3. 
Question 3
In the data set from Question 2 what is a regular expression that would allow you to count the number of countries whose name begins with "United"? Assume that the variable with the country names in it is named countryNames. How many countries begin with United? 

In [61]:
contains.(gdp_cleaned.Country, "United")

190-element BitVector:
 1
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

In [65]:
filter(x -> contains.(x.Country, "United"), gdp_cleaned)

Row,CountryCode,Rank,Country,US_dollars
Unnamed: 0_level_1,String3,String,String31,Int64
1,USA,1,United States,16244600
2,GBR,6,United Kingdom,2471784
3,ARE,32,United Arab Emirates,348595


In [42]:
occursin("United", "United States")

true

In [45]:
united_countries = occursin.("United", gdp_cleaned.Country)

190-element BitVector:
 1
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

In [46]:
countmap(united_countries)

Dict{Bool, Int64} with 2 entries:
  0 => 187
  1 => 3

## 4. 
Load the Gross Domestic Product data for the 190 ranked countries in this data set:

 https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv
 

Load the educational data from this data set:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv

Match the data based on the country shortcode. Of the countries for which the end of the fiscal year is available, how many end in June?

Same data as above, simply need to download the education data

In [66]:
download("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", "edu.csv")

"edu.csv"

In [68]:
edu_df = CSV.read("edu.csv", DataFrame)

Row,CountryCode,Long Name,Income Group,Region,Lending category,Other groups,Currency Unit,Latest population census,Latest household survey,Special Notes,National accounts base year,National accounts reference year,System of National Accounts,SNA price valuation,Alternative conversion factor,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,2-alpha code,WB-2 code,Table Name,Short Name
Unnamed: 0_level_1,String3,String,String31?,String31?,String7?,String15?,String?,String15?,String31?,String?,String?,Int64?,Int64?,String3?,String31?,Int64?,String7?,String15?,String7?,String15?,String7?,String15?,String3?,String31?,Int64?,Int64?,Int64?,String3?,String3?,String,String
1,ABW,Aruba,High income: nonOECD,Latin America & Caribbean,missing,missing,Aruban florin,2000,missing,missing,1995,missing,missing,missing,missing,missing,missing,missing,Special,missing,missing,missing,missing,missing,missing,2008,missing,AW,AW,Aruba,Aruba
2,ADO,Principality of Andorra,High income: nonOECD,Europe & Central Asia,missing,missing,Euro,Register based,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,General,missing,missing,missing,Yes,missing,missing,2006,missing,AD,AD,Andorra,Andorra
3,AFG,Islamic State of Afghanistan,Low income,South Asia,IDA,HIPC,Afghan afghani,1979,"MICS, 2003",Fiscal year end: March 20; reporting period for national accounts data: FY.,2002/2003,missing,missing,VAB,missing,missing,missing,Actual,General,Consolidated,GDDS,missing,missing,missing,missing,2008,2000,AF,AF,Afghanistan,Afghanistan
4,AGO,People's Republic of Angola,Lower middle income,Sub-Saharan Africa,IDA,missing,Angolan kwanza,1970,"MICS, 2001, MIS, 2006/07",missing,1997,missing,missing,VAP,1991-96,2005,BPM5,Actual,Special,missing,GDDS,"IHS, 2000",missing,1964-65,missing,1991,2000,AO,AO,Angola,Angola
5,ALB,Republic of Albania,Upper middle income,Europe & Central Asia,IBRD,missing,Albanian lek,2001,"MICS, 2005",missing,missing,1996,1993,VAB,missing,2005,BPM5,Actual,General,Consolidated,GDDS,"LSMS, 2005",Yes,1998,2005,2008,2000,AL,AL,Albania,Albania
6,ARE,United Arab Emirates,High income: nonOECD,Middle East & North Africa,missing,missing,U.A.E. dirham,2005,missing,missing,1995,missing,missing,VAB,missing,missing,BPM4,missing,General,Consolidated,GDDS,missing,missing,1998,missing,2008,2005,AE,AE,United Arab Emirates,United Arab Emirates
7,ARG,Argentine Republic,Upper middle income,Latin America & Caribbean,IBRD,missing,Argentine peso,2001,missing,missing,1993,missing,1993,VAB,1971-84,2005,BPM5,Actual,Special,Consolidated,SDDS,"IHS, 2006",Yes,2002,2001,2008,2000,AR,AR,Argentina,Argentina
8,ARM,Republic of Armenia,Lower middle income,Europe & Central Asia,Blend,missing,Armenian dram,2001,"DHS, 2005",missing,missing,1996,1993,VAB,1990-95,2005,BPM5,Actual,Special,Consolidated,SDDS,"IHS, 2007",Yes,missing,missing,2008,2000,AM,AM,Armenia,Armenia
9,ASM,American Samoa,Upper middle income,East Asia & Pacific,missing,missing,U.S. dollar,2000,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,missing,Yes,missing,missing,missing,missing,AS,AS,American Samoa,American Samoa
10,ATG,Antigua and Barbuda,Upper middle income,Latin America & Caribbean,IBRD,missing,East Caribbean dollar,2001,missing,The government has revised national accounts data for 1998-2008.,1990,missing,missing,VAB,missing,missing,BPM5,missing,General,missing,GDDS,missing,Yes,missing,missing,2007,1990,AG,AG,Antigua and Barbuda,Antigua and Barbuda


In [70]:
edu_gdp = innerjoin(gdp_cleaned, edu_df, on="CountryCode")

Row,CountryCode,Rank,Country,US_dollars,Long Name,Income Group,Region,Lending category,Other groups,Currency Unit,Latest population census,Latest household survey,Special Notes,National accounts base year,National accounts reference year,System of National Accounts,SNA price valuation,Alternative conversion factor,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,2-alpha code,WB-2 code,Table Name,Short Name
Unnamed: 0_level_1,String3,String,String31,Int64,String,String31?,String31?,String7?,String15?,String?,String15?,String31?,String?,String?,Int64?,Int64?,String3?,String31?,Int64?,String7?,String15?,String7?,String15?,String7?,String15?,String3?,String31?,Int64?,Int64?,Int64?,String3?,String3?,String,String
1,ABW,161,Aruba,2584,Aruba,High income: nonOECD,Latin America & Caribbean,missing,missing,Aruban florin,2000,missing,missing,1995,missing,missing,missing,missing,missing,missing,missing,Special,missing,missing,missing,missing,missing,missing,2008,missing,AW,AW,Aruba,Aruba
2,AFG,105,Afghanistan,20497,Islamic State of Afghanistan,Low income,South Asia,IDA,HIPC,Afghan afghani,1979,"MICS, 2003",Fiscal year end: March 20; reporting period for national accounts data: FY.,2002/2003,missing,missing,VAB,missing,missing,missing,Actual,General,Consolidated,GDDS,missing,missing,missing,missing,2008,2000,AF,AF,Afghanistan,Afghanistan
3,AGO,60,Angola,114147,People's Republic of Angola,Lower middle income,Sub-Saharan Africa,IDA,missing,Angolan kwanza,1970,"MICS, 2001, MIS, 2006/07",missing,1997,missing,missing,VAP,1991-96,2005,BPM5,Actual,Special,missing,GDDS,"IHS, 2000",missing,1964-65,missing,1991,2000,AO,AO,Angola,Angola
4,ALB,125,Albania,12648,Republic of Albania,Upper middle income,Europe & Central Asia,IBRD,missing,Albanian lek,2001,"MICS, 2005",missing,missing,1996,1993,VAB,missing,2005,BPM5,Actual,General,Consolidated,GDDS,"LSMS, 2005",Yes,1998,2005,2008,2000,AL,AL,Albania,Albania
5,ARE,32,United Arab Emirates,348595,United Arab Emirates,High income: nonOECD,Middle East & North Africa,missing,missing,U.A.E. dirham,2005,missing,missing,1995,missing,missing,VAB,missing,missing,BPM4,missing,General,Consolidated,GDDS,missing,missing,1998,missing,2008,2005,AE,AE,United Arab Emirates,United Arab Emirates
6,ARG,26,Argentina,475502,Argentine Republic,Upper middle income,Latin America & Caribbean,IBRD,missing,Argentine peso,2001,missing,missing,1993,missing,1993,VAB,1971-84,2005,BPM5,Actual,Special,Consolidated,SDDS,"IHS, 2006",Yes,2002,2001,2008,2000,AR,AR,Argentina,Argentina
7,ARM,133,Armenia,9951,Republic of Armenia,Lower middle income,Europe & Central Asia,Blend,missing,Armenian dram,2001,"DHS, 2005",missing,missing,1996,1993,VAB,1990-95,2005,BPM5,Actual,Special,Consolidated,SDDS,"IHS, 2007",Yes,missing,missing,2008,2000,AM,AM,Armenia,Armenia
8,ATG,172,Antigua and Barbuda,1134,Antigua and Barbuda,Upper middle income,Latin America & Caribbean,IBRD,missing,East Caribbean dollar,2001,missing,The government has revised national accounts data for 1998-2008.,1990,missing,missing,VAB,missing,missing,BPM5,missing,General,missing,GDDS,missing,Yes,missing,missing,2007,1990,AG,AG,Antigua and Barbuda,Antigua and Barbuda
9,AUS,12,Australia,1532408,Commonwealth of Australia,High income: OECD,East Asia & Pacific,missing,missing,Australian dollar,2006,missing,Fiscal year end: June 30; reporting period for national accounts data: FY.,missing,2007,1993,VAB,missing,2005,BPM5,missing,General,Consolidated,SDDS,"ES/BS, 1994",Yes,2001,2004,2008,2000,AU,AU,Australia,Australia
10,AUT,27,Austria,394708,Republic of Austria,High income: OECD,Europe & Central Asia,missing,Euro area,Euro,2001,missing,"A simple multiplier is used to convert the national currencies of EMU members to euros. The following irrevocable euro conversion rate was adopted by the EU Council on January 1, 1999: 1 euro = 13.7603 Austrian schilling. Please note that historical data before 1999 are not actual euros and are not comparable or suitable for aggregation across countries.",2000,missing,1993,VAB,missing,2005,BPM5,missing,Special,Consolidated,SDDS,IS 2000,Yes,1999-2000,2004,2008,2000,AT,AT,Austria,Austria


In [143]:
edu_gdp_special = filter(row -> any(!ismissing.(row.var"Special Notes")),edu_gdp)

Row,CountryCode,Rank,Country,US_dollars,Long Name,Income Group,Region,Lending category,Other groups,Currency Unit,Latest population census,Latest household survey,Special Notes,National accounts base year,National accounts reference year,System of National Accounts,SNA price valuation,Alternative conversion factor,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,2-alpha code,WB-2 code,Table Name,Short Name
Unnamed: 0_level_1,String3,String,String31,Int64,String,String31?,String31?,String7?,String15?,String?,String15?,String31?,String?,String?,Int64?,Int64?,String3?,String31?,Int64?,String7?,String15?,String7?,String15?,String7?,String15?,String3?,String31?,Int64?,Int64?,Int64?,String3?,String3?,String,String
1,AFG,105,Afghanistan,20497,Islamic State of Afghanistan,Low income,South Asia,IDA,HIPC,Afghan afghani,1979,"MICS, 2003",Fiscal year end: March 20; reporting period for national accounts data: FY.,2002/2003,missing,missing,VAB,missing,missing,missing,Actual,General,Consolidated,GDDS,missing,missing,missing,missing,2008,2000,AF,AF,Afghanistan,Afghanistan
2,ATG,172,Antigua and Barbuda,1134,Antigua and Barbuda,Upper middle income,Latin America & Caribbean,IBRD,missing,East Caribbean dollar,2001,missing,The government has revised national accounts data for 1998-2008.,1990,missing,missing,VAB,missing,missing,BPM5,missing,General,missing,GDDS,missing,Yes,missing,missing,2007,1990,AG,AG,Antigua and Barbuda,Antigua and Barbuda
3,AUS,12,Australia,1532408,Commonwealth of Australia,High income: OECD,East Asia & Pacific,missing,missing,Australian dollar,2006,missing,Fiscal year end: June 30; reporting period for national accounts data: FY.,missing,2007,1993,VAB,missing,2005,BPM5,missing,General,Consolidated,SDDS,"ES/BS, 1994",Yes,2001,2004,2008,2000,AU,AU,Australia,Australia
4,AUT,27,Austria,394708,Republic of Austria,High income: OECD,Europe & Central Asia,missing,Euro area,Euro,2001,missing,"A simple multiplier is used to convert the national currencies of EMU members to euros. The following irrevocable euro conversion rate was adopted by the EU Council on January 1, 1999: 1 euro = 13.7603 Austrian schilling. Please note that historical data before 1999 are not actual euros and are not comparable or suitable for aggregation across countries.",2000,missing,1993,VAB,missing,2005,BPM5,missing,Special,Consolidated,SDDS,IS 2000,Yes,1999-2000,2004,2008,2000,AT,AT,Austria,Austria
5,BEL,25,Belgium,483262,Kingdom of Belgium,High income: OECD,Europe & Central Asia,missing,Euro area,Euro,2001,missing,"A simple multiplier is used to convert the national currencies of EMU members to euros. The following irrevocable euro conversion rate was adopted by the EU Council on January 1, 1999: 1 euro = 40.3399 Belgian franc. Please note that historical data before 1999 are not actual euros and are not comparable or suitable for aggregation across countries.",2000,missing,1993,VAB,missing,2005,BPM5,missing,Special,Consolidated,SDDS,"IHS, 2000",Yes,1999-2000 (conducted annually),2004,2008,missing,BE,BE,Belgium,Belgium
6,BGD,59,Bangladesh,116355,People's Republic of Bangladesh,Low income,South Asia,IDA,missing,Bangladeshi taka,2001,"DHS, 2007",Fiscal year end: June 30; reporting period for national accounts data: FY.,1995/1996,missing,1993,VAB,missing,2005,BPM5,Preliminary,General,Consolidated,GDDS,"IHS, 2005",missing,2005,1997,2007,2000,BD,BD,Bangladesh,Bangladesh
7,BHS,138,"Bahamas, The",8149,Commonwealth of The Bahamas,High income: nonOECD,Latin America & Caribbean,missing,missing,Bahamian dollar,2000,missing,The government has revised national accounts data for 1997-2007. The new base year is 2006.,2006,missing,1993,VAB,missing,missing,BPM5,missing,General,Budgetary,GDDS,missing,missing,missing,1997,2008,missing,BS,BS,"Bahamas, The",The Bahamas
8,BLZ,169,Belize,1493,Belize,Lower middle income,Latin America & Caribbean,IBRD,missing,Belize dollar,2000,"MICS, 2006",The government has revised national accounts data for 1991-2008.,2000,missing,1993,VAB,missing,missing,BPM5,Actual,General,Budgetary,GDDS,ES/BS 1995,missing,missing,missing,2008,2000,BZ,BZ,Belize,Belize
9,BMU,149,Bermuda,5474,The Bermudas,High income: nonOECD,North America,missing,missing,Bermuda dollar,2000,missing,The Statistical Office has revised national accounts data for 1996-2007.,1996,missing,missing,VAB,missing,missing,missing,missing,missing,missing,missing,missing,Yes,missing,missing,2008,missing,BM,BM,Bermuda,Bermuda
10,BWA,117,Botswana,14504,Republic of Botswana,Upper middle income,Sub-Saharan Africa,IBRD,missing,Botswana pula,2001,"MICS, 2000",Fiscal year end: June 30; reporting period for national accounts data: FY.,1993/1994,missing,1993,VAB,missing,2005,BPM5,Preliminary,General,Budgetary,GDDS,"ES/BS, 1993/94",missing,1993,2005,2008,2000,BW,BW,Botswana,Botswana


In [146]:
june_fiscal = filter(x -> contains.(x.var"Special Notes", "Fiscal year end: June"), edu_gdp_special)  

Row,CountryCode,Rank,Country,US_dollars,Long Name,Income Group,Region,Lending category,Other groups,Currency Unit,Latest population census,Latest household survey,Special Notes,National accounts base year,National accounts reference year,System of National Accounts,SNA price valuation,Alternative conversion factor,PPP survey year,Balance of Payments Manual in use,External debt Reporting status,System of trade,Government Accounting concept,IMF data dissemination standard,Source of most recent Income and expenditure data,Vital registration complete,Latest agricultural census,Latest industrial data,Latest trade data,Latest water withdrawal data,2-alpha code,WB-2 code,Table Name,Short Name
Unnamed: 0_level_1,String3,String,String31,Int64,String,String31?,String31?,String7?,String15?,String?,String15?,String31?,String?,String?,Int64?,Int64?,String3?,String31?,Int64?,String7?,String15?,String7?,String15?,String7?,String15?,String3?,String31?,Int64?,Int64?,Int64?,String3?,String3?,String,String
1,AUS,12,Australia,1532408,Commonwealth of Australia,High income: OECD,East Asia & Pacific,missing,missing,Australian dollar,2006,missing,Fiscal year end: June 30; reporting period for national accounts data: FY.,missing,2007,1993,VAB,missing,2005,BPM5,missing,General,Consolidated,SDDS,"ES/BS, 1994",Yes,2001,2004,2008,2000,AU,AU,Australia,Australia
2,BGD,59,Bangladesh,116355,People's Republic of Bangladesh,Low income,South Asia,IDA,missing,Bangladeshi taka,2001,"DHS, 2007",Fiscal year end: June 30; reporting period for national accounts data: FY.,1995/1996,missing,1993,VAB,missing,2005,BPM5,Preliminary,General,Consolidated,GDDS,"IHS, 2005",missing,2005,1997,2007,2000,BD,BD,Bangladesh,Bangladesh
3,BWA,117,Botswana,14504,Republic of Botswana,Upper middle income,Sub-Saharan Africa,IBRD,missing,Botswana pula,2001,"MICS, 2000",Fiscal year end: June 30; reporting period for national accounts data: FY.,1993/1994,missing,1993,VAB,missing,2005,BPM5,Preliminary,General,Budgetary,GDDS,"ES/BS, 1993/94",missing,1993,2005,2008,2000,BW,BW,Botswana,Botswana
4,EGY,38,"Egypt, Arab Rep.",262832,Arab Republic of Egypt,Lower middle income,Middle East & North Africa,IBRD,missing,Egyptian pound,2006,"DHS, 2008",Fiscal year end: June 30; reporting period for national accounts data: FY.,1991/1992,missing,missing,VAB,missing,2005,BPM5,Actual,Special,Budgetary,SDDS,"ES/BS, 2004-05",Yes,1999-2000,2001,2008,2000,EG,EG,"Egypt, Arab Rep.",Egypt
5,GMB,175,"Gambia, The",917,Republic of The Gambia,Low income,Sub-Saharan Africa,IDA,HIPC,Gambian dalasi,2003,"MICS, 2005/06",Fiscal year end: June 30; reporting period for national accounts data: CY.,1987,missing,missing,VAB,missing,2005,BPM5,Estimate,General,Consolidated,GDDS,"IHS, 2003",missing,2001-2002,missing,2008,2000,GM,GM,"Gambia, The",The Gambia
6,KEN,87,Kenya,40697,Republic of Kenya,Low income,Sub-Saharan Africa,IDA,missing,Kenyan shilling,1999,"DHS, 2003, SPA, 2004",Fiscal year end: June 30; reporting period for national accounts data: CY.,2001,missing,1993,VAB,missing,2005,BPM5,Actual,General,Budgetary,GDDS,"IHS, 2005-06",missing,1977-1979,2005,2008,2003,KE,KE,Kenya,Kenya
7,KWT,56,Kuwait,160913,State of Kuwait,High income: nonOECD,Middle East & North Africa,missing,missing,Kuwaiti dinar,2005,"FHS, 1996",Fiscal year end: June 30; reporting period for national accounts data: CY.,1995,missing,missing,VAP,missing,2005,BPM5,missing,Special,Consolidated,GDDS,missing,Yes,1970,missing,2007,2002,KW,KW,Kuwait,Kuwait
8,PAK,44,Pakistan,225143,Islamic Republic of Pakistan,Lower middle income,South Asia,Blend,missing,Pakistani rupee,1998,"DHS, 2006/07",Fiscal year end: June 30; reporting period for national accounts data: FY.,1999/2000,missing,1993,VAB,missing,2005,BPM5,Actual,General,Consolidated,GDDS,"LSMS, 2004/05",missing,2000,missing,2008,2000,PK,PK,Pakistan,Pakistan
9,PRI,61,Puerto Rico,101496,Puerto Rico,High income: nonOECD,Latin America & Caribbean,missing,missing,U.S. dollar,2000,"RHS, 1995/96",Fiscal year end: June 30; reporting period for national accounts data: FY.,1954,missing,missing,VAP,missing,missing,missing,missing,General,missing,missing,missing,Yes,1997/2002,missing,missing,missing,PR,PR,Puerto Rico,Puerto Rico
10,SLE,157,Sierra Leone,3796,Republic of Sierra Leone,Low income,Sub-Saharan Africa,IDA,HIPC,Sierra Leonean leone,2004,DHS 2008,Fiscal year end: June 30; reporting period for national accounts data: CY.,1990,missing,1993,VAB,missing,2005,BPM5,Preliminary,Special,Budgetary,GDDS,"IHS, 2003",missing,1984-1985,missing,2002,2000,SL,SL,Sierra Leone,Sierra Leone


In [148]:
size(june_fiscal)

(13, 34)

Answer is 13 rows - 13 countries have a fiscal year which ends on June 30 - what a roller coaster!!!!! Patience and clear headedness are needed 

## 5. 
You can use the quantmod (
http://www.quantmod.com/
) package to get historical stock prices for publicly traded companies on the NASDAQ and NYSE. Use the following code to download data on Amazon's stock price and get the times the data was sampled. 