# Overview:

## The purpose of this project is to work with semi-structured data and nested dictionaries using loading JSON format files into python, to perform processing and extraction tasks and some initial EDA and punt out select data into another data structure format (CSV).

### Details Below:



In [9]:
### Goal: 

#   -Explore length and data type
#   -Clean missing values by checking membership of keys in record using built function
#   -Count horse records by birth year using frequency dictionary 
#   -Write out the entire contents of the horses json data file into a CSV
#.   with missing values filled in with 'missing' and the fields with wikidata ids changed to from the URL to just the ID.           
#    
#  Headers are:                                          
#    horse, horseLabel, mother, father, birthyear, genderLabel 

In [10]:
import json                                                                
import csv                                                                 
                                                                           
def get_id_from_url(url_string):                                           
    '''pass this a string with a wikidata url                              
       it will return just the id from that url'''                         
    if url_string == 'missing':                                            
        wikidataid= 'missing'                                              
    else:                                                                  
        parts = url_string.split('/')                                      
        wikidataid = parts[-1]                                             
    return wikidataid                                                      
                                                                           
def clean_missing(key, record):                                            
    '''function will check if a provided key is inside a record            
    and return either that value or the string value of missing.'''        
    if key in record:                                                      
        result = record[key]                                               
    else:                                                                  
        result = 'missing'                                                 
    return result                                                          

In [3]:
# here's an example of a single horse entry for reference:     
                                                               
"""                                                            
{                                                              
	"horse": "http://www.wikidata.org/entity/Q1001792",        
	"horseLabel": "Makybe Diva",                               
	"mother": "http://www.wikidata.org/entity/Q14949904",      
	"father": "http://www.wikidata.org/entity/Q5263956",       
	"birthyear": "1999",                                       
	"genderLabel": "female organism"                           
}"""                                                           

'                                                            \n{                                                              \n\t"horse": "http://www.wikidata.org/entity/Q1001792",        \n\t"horseLabel": "Makybe Diva",                               \n\t"mother": "http://www.wikidata.org/entity/Q14949904",      \n\t"father": "http://www.wikidata.org/entity/Q5263956",       \n\t"birthyear": "1999",                                       \n\t"genderLabel": "female organism"                           \n}'

In [11]:
###        Notes: 

####       Run select values - horse, mother, father - through the get_id_from_url function to get just the ID out.                                       
####       Run all values through the clean_missing function to check for missing values.                                                                         
####       Create outfile using csv writer                          
####       Write out the headers                                    
####       Loop over the data list                                  
####       Extract target info from records                         
####       Clean info                                           
####       Write out the processed data rows.                       

In [12]:
# EDA questions                                                                                                                  
                                                                                                                                 
# 1: How many horses were born in 1999?                                                                                          
# 2: What is the range of birth years for all the horses included in this dataset?                                               
# 3: What percentage of horses were missing father values in this dataset compared to the proportion missing mother values?      
                                                                                                                                 
                                                                                                                                 

In [14]:
infile = open('source_data_horses_sample.json', 'r', encoding = 'utf-8')                    
data = json.load(infile)                                                                    
infile.close()                                                                              
                                                                                            

In [15]:
 # length and data type of the data variable    
                                                
                                                
print("length:", len(data))                     
                                                
print("data type:", type(data))                 
                                                

length: 1000
data type: <class 'list'>


In [6]:
# Goal: count the number of horses born in each year, print the dictionary.                          
#Info: Loop through the records, accessing the birth year.                                           
#Then use a dictionary counting pattern to count the years.                                          
#For horse records without birth years, checked for membership and set value to missing.             
                                                                                                     
year_count = {}                                                                                      
for record in data:                                                                                  
                                                                                                     
    year = clean_missing("birthyear", record)                                                        
    if year in year_count:                                                                           
        year_count[year] +=1                                                                         
                                                                                                     
    else:                                                                                            
        year_count[year] = 1                                                                         
print(year_count)                                                                                    

{'1999': 25, '1986': 18, '1992': 19, 'missing': 35, '1684': 1, '1975': 12, '1978': 10, '1960': 7, '1987': 19, '1955': 9, '1956': 8, '2003': 40, '1966': 3, '1994': 25, '2004': 32, '1971': 12, '1991': 28, '2001': 34, '1988': 12, '1998': 26, '1997': 30, '1989': 17, '2005': 42, '1996': 21, '1993': 28, '2006': 36, '1990': 17, '1950': 4, '1972': 7, '2009': 12, '2002': 42, '1920': 1, '1867': 1, '1748': 1, '1977': 10, '1981': 13, '1970': 9, '1879': 1, '1807': 1, '1890': 1, '1969': 16, '2000': 26, '1995': 32, '1985': 17, '1976': 4, '1964': 3, '1983': 14, '1959': 1, '1928': 2, '1935': 2, '1921': 1, '1979': 13, '2007': 25, '1984': 16, '2008': 14, '2010': 8, '1915': 1, '1937': 1, '1967': 5, '1929': 2, '1944': 2, '1948': 7, '1974': 6, '1947': 2, '1980': 10, '1787': 1, '1961': 3, '1962': 7, '1982': 10, '1932': 4, '1926': 2, '1957': 8, '1968': 4, '1954': 3, '1946': 5, '1931': 2, '1952': 3, '1901': 1, '1941': 1, '1973': 7, '1930': 2, '1909': 1, '1949': 4, '1953': 3, '1965': 5, '1933': 1, '1936': 2, '1

In [16]:
# Goal: write the counted birth years dictionary to a CSV file, and called           
#       this file 'yearcounts.csv'                                                   
# Notes:                                                                             
#       Created outfile object, passed it to the csv.writer() function.              
#       Wrote out the headers: year, count                                           
#       Loop through the counted dictionary via .items().                            
#       Rows contain the birth year and the count.                                   
                                                                                     
outfile = open('yearcounts.csv', 'w', encoding = 'utf-8')                            
                                                                                     
csvout = csv.writer(outfile)                                                         
                                                                                     
csvout.writerow(['year', 'count'])                                                   
                                                                                     
for pair in year_count.items():                                                      
    year = pair[0]                                                                   
    count = pair[1]                                                                  
    row = [year, count]                                                              

    csvout.writerow(row)                                                             
                                                                                     