# Reproducing Notebook

Due to the unique nature of my project and the complexities involving being able to reproduce this project using code, this notebook will predominantly be a markdown file describing the steps I took in creating this project.

# First steps: Gathering the data

Before being able to investigate any questions, I first needed to determine which datasets I would be using in this project. While many data sources were considered (some even considered and then discarded after beginning the extraction process), I eventually settled on the following sources and their respective websites. Within my documentation on Github, it is noted exactly which days I first access these sites and when I began to pull data. Versions of each site are saved locally on my computer from the day I began extraction.

**My sources:**
* American Kennel Club (AKC), Breed Popularity Statistics: https://www.akc.org/expert-advice/news/most-popular-dog-breeds-full-ranking-list/;
* Westminster Dog Show - Best in Show: https://fwkc-web-prod.corebine.com/en/best-in-show-winners,
* The Puppybowl: http://www.animalplanet.com/tv-shows/puppy-bowl/photos/puppy-bowl-xiii-starting-lineup/,
http://www.animalplanet.com/tv-shows/puppy-bowl/photos/puppy-bowl-xii-starting-lineup/,
http://www.animalplanet.com/tv-shows/puppy-bowl/photos/xi-starting-lineup/,
http://www.animalplanet.com/tv-shows/puppy-bowl/photos/x-starting-lineup/,
http://www.animalplanet.com/tv-shows/puppy-bowl/photos/ix-starting-lineup-pictures/.
* KnowYourMeme: https://knowyourmeme.com/.
* Wikipedia: https://www.wikipedia.org.

**Pulling the data**
My data was gathered in several different ways. For Westminster and AKC, the data was available in tabular format and easily copy/pasted into Excel to then be cleaned. KnowYourMeme data was gathered manually and entered into an Excel file. For KnowYourMeme, "dog" was searched within the database. Any memes marked as "confirmed" by the site were noted, along with the year associated with their emergence and the dog breed associated with said meme. For example, the "Doge" meme, sporting a Shiba Inu, first appeared in 2010, only to pick up popularity in 2013. The PuppyBowl data, pulled from the starting lineups from the past 5 years. This data was pulled using an XPath expression: //span[@class='more-information global-description']/span/text(). The results were then copy/pasted into an Excel file to then be cleaned. Finally, the Wikipedia data was accessed utilizing a JSON script written by Elizabeth Wickes. The resulting JSON files were then read in using Python and exported to Excel for cleaning purposes.

# American Kennel Club

This data file possessed very minimal cleaning. It was uploaded into OpenRefine. This data was text faceted, compared, and then hand-edited as a means of spell checking and identifying that each data point present was unique. The cleaned data was then exported to a new Excel file. In total, the cleaning of this dataset took approximately 10 to 15 minutes. The cleaned file can be found below. Only the data from 2017 was used in the final dataset.

In [1]:
import pandas as pd
akc = pd.read_excel("AKC_Popularity_Rankings.xlsx")
print(akc)

                              Breed  2017 Rank  2016 Rank  2015 Rank  \
0             Retrievers (Labrador)          1        1.0        1.0   
1              German Shepherd Dogs          2        2.0        2.0   
2               Retrievers (Golden)          3        3.0        3.0   
3                   French Bulldogs          4        6.0        6.0   
4                          Bulldogs          5        4.0        4.0   
5                           Beagles          6        5.0        5.0   
6                           Poodles          7        7.0        8.0   
7                       Rottweilers          8        8.0        9.0   
8                Yorkshire Terriers          9        9.0        7.0   
9     Pointers (German Shorthaired)         10       11.0       11.0   
10                           Boxers         11       10.0       10.0   
11                 Siberian Huskies         12       12.0       12.0   
12                       Dachshunds         13       13.0       

# Westminster Dog Show

Minimal cleaning was necessary for this dataset. Several rows that were indicated as missing data were edited out in Excel prior to being uploaded into OpenRefine. Once in OpenRefine, rows deemed unnecessary (name of dog, names of owners, and names of judges) were removed. This left just the breed data. Whitespace was removed. The breed column was then text faceted and evaluated to detect any spelling inconsistencies. Once cleaned, the data was then exported to a new Excel file and then sorted alphabetically in preparation for combining data. In total, the cleaning took approximately 5 to 10 minutes total to complete.

The data is stored in an Excel file. The cleaned file only includes one column (dog breed) and 112 rows. The column is the names of each dog breed and each row reflects a winning dog. The information in this column is textual and is being treated as categorical.

There is a single column in the cleaned version of this dataset. This column contains 112 rows. This singular column contains the breed of the dog that won the Westminster Dog Show from 1907 until 2017 (the 2018 awards had not yet taken place when the data was extracted). There are missing values in this dataset.

In [2]:
westminster = pd.read_excel("Cleaned_Westminster.xlsx")
print(westminster)

     YEAR                         BREED
0    1907            Smooth Fox Terrier
1    1908            Smooth Fox Terrier
2    1909            Smooth Fox Terrier
3    1910            Smooth Fox Terrier
4    1911              Scottish Terrier
5    1912              Airedale Terrier
6    1913                       Bulldog
7    1914          Old English Sheepdog
8    1915              Wire Fox Terrier
9    1916              Wire Fox Terrier
10   1917              Wire Fox Terrier
11   1918          Bull Terrier (White)
12   1919              Airedale Terrier
13   1920              Wire Fox Terrier
14   1921  Cocker Spaniel (Parti-Color)
15   1922              Airedale Terrier
16   1923                           NaN
17   1924              Sealyham Terrier
18   1925                       Pointer
19   1926              Wire Fox Terrier
20   1927              Sealyham Terrier
21   1928              Wire Fox Terrier
22   1929                Collie (Rough)
23   1930              Wire Fox Terrier


# The Puppybowl

This data source required quite a bit of cleaning, all of which was done in either Execl or OpenRefine, with each year possessing its own specific data cleaning needs. Years 10 and 11 were separated by colons as deliminators. Years 12 - 15 were separated by slashes (however, also contained colons indication name, breed, sex, and location). The first pass of restructuring the data, not to be separated by punctuation but placed into individual columns, occured in Excel. This largely separated out what I needed, but extensive cleaning still needed to occur. Each dataset was uploaded individually into OpenRefine.

All files ended up with unnecessary columns that were removed, such as name, sex, fun fact, and age. For X, due to the delimination, the text within "Breed" all ended with the word "Sex" appended at the end. Using a simple Python script, I removed the word from the Breed column. Additionally, a few of the longer breed names had their information cut off when text was initially extracted from the website, so I manually added the final letters to complete the breed names when needed. XI ended up with "Breed: " before the names of each breed which, likewise, was removed using Python within OpenRefine. XII had the most cleaning involved, as the raw data included missing values. Within the raw data, the second column represented team name. However, not every dog had a team name entered. Therefore, some values from the third column, Breed, ended up populating the second column. Since this dataset was relatively small, the migration from one column to the next was done by hand. XIII and XIV were the easiest to clean, as the raw data separated cleanly once parsed by deliminator. However, these later years denoted mixed breed by using "-" instead of the earlier "/". I ran a simple Python script to replace all instances of "-" with "/" to create uniformity across all datasets.

Each dataset was then text faceted individually to check spelling. Once cleaned, all five files were exported into new Excel files. The data from these were all taken and then combined into a master file. This was then once again uploaded into OpenRefine. The master file had all of the whitespaces removed and then I once again text faceted to ensure spelling was correct and uniform. Once exported back into Excel, the column was sorted alphabetically in preparation for being added to the combined file. Entirely, this cleaning process took approximately 4 hours.

**This first example is of an uncleaned dataset: XII**

In [3]:
xii_lineup = pd.read_excel("XII_Lineup.xlsx")
print(xii_lineup)

                 Name                                     Team  \
0      Name: Atticus                  Breed: Husky / Labrador    
1        Name: Bella                               Team: Ruff    
2       Name: Bijoux                              Team: Fluff    
3        Name: Boris                          Breed: Havanese    
4     Name: Brooklyn                   Breed: German Shepherd    
5      Name: Buttons                              Team: Fluff    
6     Name: Carolina    Breed: American Staffordshire Terrier    
7      Name: Charlie                              Team: Fluff    
8       Name: ChiChi                         Breed: Chihuahua    
9       Name: Clover                      Breed: Basset Hound    
10      Name: Cooper                              Team: Fluff    
11    Name: Countess                              Team: Fluff    
12       Name: Darby                               Team: Ruff    
13       Name: Dilly                               Team: Ruff    
14      Na

**The following is an example of the same dataset cleaned**

In [4]:
copy_xii_lineup = pd.read_excel("Copy of XII_Lineup.xlsx")
print(copy_xii_lineup)

          Name                           Breed
0      Atticus                Husky / Labrador
1        Bella                     Rat Terrier
2       Bijoux                      Great Dane
3        Boris                        Havanese
4     Brooklyn                 German Shepherd
5      Buttons                        Shih Tzu
6     Carolina  American Staffordshire Terrier
7      Charlie                 Feist / Whippet
8       ChiChi                       Chihuahua
9       Clover                    Basset Hound
10      Cooper         Great Pyrenees / Collie
11    Countess                 English Bulldog
12       Darby            Jack Russell Terrier
13       Dilly                  Terrier Poodle
14      Gordon                  English Setter
15     Gryffin            Old English Sheepdog
16      Hailey             Wire Haired Terrier
17        Hank         Yellow Labrador / Husky
18      Harper                        Pit Bull
19       Jimmy                     Spaniel mix
20       Kevi

**The following is the master file of combined Puppybowl Data**

In [5]:
master = pd.read_excel("Combined_Datafiles.xlsx")
print(master)

      Labrador Retriever/Terrier Mix
0                             Poodle
1        Labrador Retriever/Sato Mix
2                              Boxer
3                          Dalmatian
4    Bernese Mountain Dog/Poodle Mix
5                    Boxer/Hound Mix
6                 Labrador Retriever
7                    American Eskimo
8                         Poodle Mix
9                            Spaniel
10                      Papillon Mix
11                       Terrier Mix
12          Old English Sheepdog Mix
13                    Great Pyrenees
14             Shih Tzu/Pekingese Mi
15              Brittany Spaniel Mix
16                     Bassett Hound
17              Brittany Spaniel Mix
18                        Poodle Mix
19               Dachshund/Hound Mix
20                          Shih Tzu
21                          Pit Bull
22             Schnauzer/Terrier Mix
23      Labrador Retriever/Husky Mix
24                      Havenese Mix
25    Shih Tzu/Brussells Griffon Mix
2

# KnowYourMeme

This data required quite a bit of curation but very minimal cleaning. Even though this data was compiled straight from the website into Excel by hand, the dataset was still uploaded into OpenRefine to clean any whitespaces, double check spelling, and ensure the uniformity in the date formats. In total, the cleaning aspect of this process took maximum 5 minutes, as there were very few data points being utilized in this set.

In [6]:
meme = pd.read_excel("GenericDogMemes_Breeds_Year.xlsx")
print(meme)

                        Meme Name                Associated Breed  \
0                            Doge                       Shiba Inu   
1                    Doggo (term)                             NaN   
2              Sleep Tight Pupper                       Chihuahua   
3                         Pun Dog                Alaskan Klee Kai   
4                     Cupcake Dog              Australian Shepard   
5                        Cool Dog                       Shiba Inu   
6                    Birthday dog             bichon frise/poodle   
7                Yes, this is dog                  black Labrador   
8                    Broccoli Dog                       Shiba Inu   
9                    Dad joke dog                             NaN   
10                 Dumbstruck dog                           Boxer   
11  I have no idea what I'm doing                Golden Retriever   
12               Who's a good boy                             NaN   
13                    Copying dog 

# Beginning Combined Datasets

Using the AKC rankings from 2017, the previously mentioned datasets were combined together. The number of breeds represented in the datasets were counted, and a combined dataset was created.

In [7]:
combined = pd.read_excel("CombinedData.xlsx")
print(combined)

        Breed (2017 Ranked Listing)  Breed Class  \
0             Retrievers (Labrador)          NaN   
1              German Shepherd Dogs          NaN   
2               Retrievers (Golden)          NaN   
3                   French Bulldogs          NaN   
4                          Bulldogs          NaN   
5                           Beagles          NaN   
6                           Poodles          NaN   
7                       Rottweilers          NaN   
8                Yorkshire Terriers          NaN   
9     Pointers (German Shorthaired)          NaN   
10                           Boxers          NaN   
11                 Siberian Huskies          NaN   
12                       Dachshunds          NaN   
13                      Great Danes          NaN   
14            Pembroke Welsh Corgis          NaN   
15               Doberman Pinschers          NaN   
16             Australian Shepherds          NaN   
17             Miniature Schnauzers          NaN   
18   Cavalie

This data was then filtered based on which columns possessed the most representation across the board.

In [8]:
text = open("Dogs_to_Search.txt", "r")
readtext = text.read()
print(readtext)

Labrador Retrievers
German Shepherd Dogs
Golden Retrievers
Bulldogs
Beagles
Poodles
Yorkshire Terriers
Boxers
Siberian Huskies
Pembroke Welsh Corgis * just search as corgi
Doberman Pinschers
Australian Shepherds
Shih Tzu
Pomeranians
English Springer Spaniels
Cocker Spaniels
Pugs
Chihuahuas
Shiba Inu
Bichons Frises
Papillons
Pekingese
Brussels Griffons
Wire Fox Terriers



Since I was now only working with a select number of dog breeds, I altered my combined dataset to resemble this. The Breed Class column shown in the dataset below was added after using AKC records based on a recommendation by Elizabeth Wickes. This did not come from a database but was manually compiled.

In [9]:
final = pd.read_excel("FilteredDog_Dataset_Combined.xlsx")
print(final)

                     Breed Name  Ranked Number (AKC)       Breed Class (AKC)  \
0          Retriever (Labrador)                    1                Sporting   
1          German Shepherd Dogs                    2        Herding/Guardian   
2           Retrievers (Golden)                    3                Sporting   
3                      Bulldogs                    5            Non-sporting   
4                       Beagles                    6                   Hound   
5                       Poodles                    7  Standard/Miniature/Toy   
6            Yorkshire Terriers                    9                     Toy   
7                        Boxers                   11                 Working   
8              Siberian Huskies                   12                 Working   
9         Pembroke Welsh Corgis                   15                 Herding   
10           Doberman Pinschers                   16                 Working   
11         Australian Shepherds         

# Wikipedia

This dataset will likely require some of the most time intensive cleaning and curation. The alphabetized files from other datasets were then run and the counts for each dog breed mentioned was determined. This data was then input into a master file, combining the breeds (in ranked order) from AKC, with columns denoting PuppyBowl, Westminster, and Memes. The rows were filled with how many times each breed was mentioned in each dataset. For example, the Boxer, ranked 11th on AKC popularity, occurred 4 times in the last 5 years of the PuppyBowl, 4 times in Westminster Dog Show, and 1 time in KnowYourMeme. Dogs that possessed data in two or more categories were flagged. Those flagged breeds were then ran through code, received from Elizabeth Wickes, pertaining to the Wikipedia API. This code specifically pulls the namespace, page id, page title, page size, word count, a snippet, and timestamp of last edit and creates a combined JSON file. These 24 breed names then output 24 JSON files of various size.

Wikipedia JSON extraction code was written by and belongs to Elizabeth Wickes (Github: elliewix). The Wikimedia Foundation owns the wikipedia.org domain being utilzied for this project, however it us unclear who owns individual Wikipedia pages and their respective contents.

I ran my filtered results through Elizabeth's extraction code, resulting in a number of JSON files. To get data from all files into a single .csv file, I used the following code:

In [10]:
import json
import csv
import os
import glob

allfiles = glob.glob('*.json')

print(allfiles)
# import os
# going to create a list i.e. listdir(source)
# for file in filepath + file
allrows = []
for path in allfiles:

    infile = open(path, "r")
    text = infile.read()
    infile.close()

    data = json.loads(text)

    for record in data:
        row = []
        row.append(path)
        pagetitle = record['title'].encode('utf-8')
        pageid = record['pageid']
        wordcount = record['wordcount']
        snippet = record['snippet'].encode('utf-8')
        row.append(pagetitle)
        row.append(pageid)
        row.append(wordcount)
        row.append(snippet)
        allrows.append(row)

headers = ['path','pagetitle', 'pageid', 'wordcount', 'snippet']

with open('allresults.csv', 'w', newline='') as outfile:
    csvout = csv.writer(outfile)
    csvout.writerow(headers)
    csvout.writerows(allrows)

['aussieresults.json', 'beagleresults.json', 'bichonresults.json', 'boxerresults.json', 'brusselsresults.json', 'bulldogresults.json', 'chihuahuaresults.json', 'cockerspanielresults.json', 'corgiresults.json', 'dobermanresults.json', 'englishspringerresults.json', 'germanshepresults.json', 'goldenretresults.json', 'labresults.json', 'papillonresults.json', 'pekingeseresults.json', 'pomeranianresults.json', 'poodleresults.json', 'pugresults.json', 'shibaresults.json', 'shih2results.json', 'shihresults.json', 'sibhuskresults.json', 'wirefoxresults.json', 'yorkshireterrierresults.json']


The Python script above was written grabbing the filepaths of each breed JSON file, and exported select information (file name, page id, page title, word count, and snippet) to a new Excel file. This file contains upwards of 15,000 total rows. As suggested by Elizabeth, a Keep column was added. As I go through each file name, I indicate by hand "t" for true or "f" for false. While some pages are obvious in respect to their inclusion, (i.e. the page "Pembroke Welsh Corgi" should be kept), others are harder to judge and may require actually looking up the page on Wikipedia to determine its status. Once completed, all of the "t"s will be filtered out and exported to a new file. This new file will be my official Wikipedia dataset. So far, I have spent upwards of 8 hours on this cleaning aspect. This is a first pass, meaning that the t/f markings are based on initial impression / what I can determine just looking at the information provided within the .csv file.

** I was having trouble uploading my most recent .csv here, so I just called the original one without all of my t/f edits**

In [12]:
editing_wiki = pd.read_csv("allresults.csv")
print(editing_wiki)

                               path  \
0                aussieresults.json   
1                aussieresults.json   
2                aussieresults.json   
3                aussieresults.json   
4                aussieresults.json   
5                aussieresults.json   
6                aussieresults.json   
7                aussieresults.json   
8                aussieresults.json   
9                aussieresults.json   
10               aussieresults.json   
11               aussieresults.json   
12               aussieresults.json   
13               aussieresults.json   
14               aussieresults.json   
15               aussieresults.json   
16               aussieresults.json   
17               aussieresults.json   
18               aussieresults.json   
19               aussieresults.json   
20               aussieresults.json   
21               aussieresults.json   
22               aussieresults.json   
23               aussieresults.json   
24               aussiere