# Foundations of Computer Science - progetto 2022/23

- Mattia Birti [897092] 
- Alberto Porrini [826306]
- Gloria Longo [864579]

You have to work on the Dogs adoptions dataset:
*  [Dogs](https://github.com/mat1218B/ProjectOfComputerScience/blob/main/adoptions/dogs.csv)
*  [Dog Travel](https://github.com/mat1218B/ProjectOfComputerScience/blob/main/adoptions/dogTravel.csv)
*  [NST-EST2021-POP](https://github.com/mat1218B/ProjectOfComputerScience/blob/main/adoptions/NST-EST2021-POP.csv)

### Notes

1. It is mandatory to use GitHub for developing the project.
1. The project must be a jupyter notebook.
1. There is no restriction on the libraries that can be used, nor on the Python version.
1. All questions on the project must be asked in a public channel on Zulip.
1. At most 3 students can be in each group. You must create the groups by yourself.
1. You do not have to send me the project before the discussion.s

## Libraries

In [1]:
import pandas as pd
from datetime import datetime
from difflib import SequenceMatcher
import pycountry #have to install it (pip install pycountry)

### Import data from GitHub

#### Dogs.csv

In [2]:
dogs = pd.read_csv("https://raw.githubusercontent.com/mat1218B/ProjectOfComputerScience/main/adoptions/dogs.csv", sep=",", encoding='latin-1')

In [3]:
dogs.shape

(58180, 37)

In [4]:
dogs.head()

Unnamed: 0,id,org_id,url,type.x,species,breed_primary,breed_secondary,breed_mixed,breed_unknown,color_primary,...,contact_city,contact_state,contact_zip,contact_country,stateQ,accessed,type.y,description,stay_duration,stay_cost
0,46042150,NV163,https://www.petfinder.com/dog/harley-46042150/...,Dog,Dog,American Staffordshire Terrier,Mixed Breed,True,False,White / Cream,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,Harley is not sure how he wound up at shelter ...,70,124.81
1,46042002,NV163,https://www.petfinder.com/dog/biggie-46042002/...,Dog,Dog,Pit Bull Terrier,Mixed Breed,True,False,Brown / Chocolate,...,Las Vegas,NV,89147,US,89009,2019-09-20,Dog,6 year old Biggie has lost his home and really...,49,122.07
2,46040898,NV99,https://www.petfinder.com/dog/ziggy-46040898/n...,Dog,Dog,Shepherd,,False,False,Brindle,...,Mesquite,NV,89027,US,89009,2019-09-20,Dog,Approx 2 years old.\n Did I catch your eye? I ...,87,281.51
3,46039877,NV202,https://www.petfinder.com/dog/gypsy-46039877/n...,Dog,Dog,German Shepherd Dog,,False,False,,...,Pahrump,NV,89048,US,89009,2019-09-20,Dog,,62,145.83
4,46039306,NV184,https://www.petfinder.com/dog/theo-46039306/nv...,Dog,Dog,Dachshund,,False,False,,...,Henderson,NV,89052,US,89009,2019-09-20,Dog,Theo is a friendly dachshund mix who gets alon...,93,241.09


#### DogTravel.csv

In [5]:
dogTravel = pd.read_csv("https://raw.githubusercontent.com/mat1218B/ProjectOfComputerScience/main/adoptions/dogTravel.csv", sep=",", encoding='latin-1')

In [6]:
dogTravel.shape

(6194, 9)

In [7]:
dogTravel.head()

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there
0,0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made hi...,Arkansas,,,
1,1,44698509,Groveland,FL,Duke is an almost 2 year old Potcake from Abac...,Abacos,Bahamas,,
2,2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star lookin...,Adam,Maryland,,
3,3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from an...,Adaptil,,True,
4,4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her peo...,Afghanistan,,,


#### NST-EST.csv

In [8]:
nstest = pd.read_csv("https://raw.githubusercontent.com/mat1218B/ProjectOfComputerScience/main/adoptions/NST-EST2021-POP.csv", header=None, sep=",", encoding='latin-1')

In [9]:
nstest.shape

(51, 2)

In [10]:
nstest.head()

Unnamed: 0,0,1
0,Alabama,5.024.279
1,Alaska,733.391
2,Arizona,7.151.502
3,Arkansas,3.011.524
4,California,39.538.223


## 1. Extract all dogs with status that is not adoptable

Explore the dog's column names

In [11]:
dogs.columns

Index(['id', 'org_id', 'url', 'type.x', 'species', 'breed_primary',
       'breed_secondary', 'breed_mixed', 'breed_unknown', 'color_primary',
       'color_secondary', 'color_tertiary', 'age', 'sex', 'size', 'coat',
       'fixed', 'house_trained', 'declawed', 'special_needs', 'shots_current',
       'env_children', 'env_dogs', 'env_cats', 'name', 'status', 'posted',
       'contact_city', 'contact_state', 'contact_zip', 'contact_country',
       'stateQ', 'accessed', 'type.y', 'description', 'stay_duration',
       'stay_cost'],
      dtype='object')

Explore the first row

In [12]:
dogs.iloc[0,:]

id                                                          46042150
org_id                                                         NV163
url                https://www.petfinder.com/dog/harley-46042150/...
type.x                                                           Dog
species                                                          Dog
breed_primary                         American Staffordshire Terrier
breed_secondary                                          Mixed Breed
breed_mixed                                                     True
breed_unknown                                                  False
color_primary                                          White / Cream
color_secondary                          Yellow / Tan / Blond / Fawn
color_tertiary                                                   NaN
age                                                           Senior
sex                                                             Male
size                              

We can see that there are two variables (posted, accessed) that contain dates, let's check if they are in the right format

In [13]:
type(dogs.loc[0, 'posted'])
type(dogs.loc[0, 'accessed'])

str

Since they are in string format, we convert them to date format

In [14]:
#for i in range(len(dogs)):
        #dogs.loc[i, 'posted']=datetime.strptime(dogs.loc[i,'posted'], "%Y-%m-%dT%H:%M:%S+0000")
        #dogs.loc[i,'accessed']=datetime.strptime(dogs.loc[i,'accessed'], "%Y-%m-%d")

We can see that there are some variables that do not contain a date, let's see which ones

In [15]:
#for i in range(len(dogs)):
#    if(len(dogs.loc[i,'posted'])!=24):
#        print(dogs.iloc[i,])

The problem with these rows is that they have not been split properly in cell 'name'. Let's solve the problem and transform the dates into datatime format

In [16]:
for i in range(len(dogs)):
    if(len(dogs.loc[i,'posted'])==24):
        dogs.loc[i, 'posted']=datetime.strptime(dogs.loc[i,'posted'], "%Y-%m-%dT%H:%M:%S+0000")
        dogs.loc[i,'accessed']=datetime.strptime(dogs.loc[i,'accessed'], "%Y-%m-%d")
    else :
        j=26
        l=(dogs.iloc[i,24]).split(' ',1)
        prov=dogs.iloc[i,25]
        dogs.iloc[i,25]=l[1]
        dogs.iloc[i,24]=l[0]
        while(j<33):
            prov2=dogs.iloc[i,j]
            dogs.iloc[i,j]=prov
            prov=prov2
            j=j+1
        dogs.loc[i, 'posted']=datetime.strptime(dogs.loc[i,'posted'], "%Y-%m-%dT%H:%M:%S+0000")
        dogs.loc[i,'accessed']=datetime.strptime(dogs.loc[i,'accessed'], "%Y-%m-%d")

Now we can solve this task using a 'while' loop which scrolls through all the rows of the dataframe and stores only the identification code of the dogs that have a status other than adoptable

In [17]:
lista=[]
i=0
while(i<len(dogs)):
    if (dogs.loc[i,'status']!='adoptable'):
        lista.append(dogs.loc[i,'id'])
    i=i+1
lista

[41330726,
 38169117,
 45833989,
 45515547,
 45294115,
 45229004,
 45227052,
 45569380,
 44694387,
 36978896,
 33218331,
 42092005,
 39594038,
 45895274,
 45964719,
 44538917,
 41430442,
 45907639,
 45362806,
 32590894,
 31426754,
 46037827,
 44044071,
 27521132,
 38473806,
 34101432,
 45958435,
 45927580,
 45916348,
 45733027,
 45413997,
 45406516,
 45264615]

## 2. For each (primary) breed, determine the number of dogs

In [18]:
dogs.groupby('breed_primary').count()['id']

breed_primary
Affenpinscher                         17
Afghan Hound                           4
Airedale Terrier                      19
Akbash                                 3
Akita                                181
                                    ... 
Wirehaired Pointing Griffon            1
Wirehaired Terrier                    60
Xoloitzcuintli / Mexican Hairless     11
Yellow Labrador Retriever            158
Yorkshire Terrier                    360
Name: id, Length: 216, dtype: int64

## 3. For each (primary) breed, determine the ratio between the number of dogs of Mixed Breed and those not of Mixed Breed. Hint: look at the secondary_breed.

Let's see all the primary breed with a groupBy:

In [19]:
nBreed = dogs.groupby('breed_primary').count()[['species','breed_secondary']]
nBreed

Unnamed: 0_level_0,species,breed_secondary
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1
Affenpinscher,17,2
Afghan Hound,4,1
Airedale Terrier,19,9
Akbash,3,0
Akita,181,52
...,...,...
Wirehaired Pointing Griffon,1,0
Wirehaired Terrier,60,18
Xoloitzcuintli / Mexican Hairless,11,4
Yellow Labrador Retriever,158,62


As we can see we can use "species" as total number of individues for this species.
Affenpischer are in total 17 dogs, 2 of those are mixed breed.

Rename colums

In [20]:
nBreed = nBreed.rename( columns = {'species':'total', 'breed_secondary':'nSecondary'})
nBreed

Unnamed: 0_level_0,total,nSecondary
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1
Affenpinscher,17,2
Afghan Hound,4,1
Airedale Terrier,19,9
Akbash,3,0
Akita,181,52
...,...,...
Wirehaired Pointing Griffon,1,0
Wirehaired Terrier,60,18
Xoloitzcuintli / Mexican Hairless,11,4
Yellow Labrador Retriever,158,62


We have to compute how many are primary breed:

In [21]:
nBreed['nPrimary'] = nBreed['total'] - nBreed['nSecondary']
nBreed

Unnamed: 0_level_0,total,nSecondary,nPrimary
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Affenpinscher,17,2,15
Afghan Hound,4,1,3
Airedale Terrier,19,9,10
Akbash,3,0,3
Akita,181,52,129
...,...,...,...
Wirehaired Pointing Griffon,1,0,1
Wirehaired Terrier,60,18,42
Xoloitzcuintli / Mexican Hairless,11,4,7
Yellow Labrador Retriever,158,62,96


It's time to compute the ratio between Primary and Secondary Breed.

In [22]:
nBreed['ratioBreed'] = nBreed['nPrimary'] / nBreed['nSecondary']
nBreed

Unnamed: 0_level_0,total,nSecondary,nPrimary,ratioBreed
breed_primary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Affenpinscher,17,2,15,7.500000
Afghan Hound,4,1,3,3.000000
Airedale Terrier,19,9,10,1.111111
Akbash,3,0,3,inf
Akita,181,52,129,2.480769
...,...,...,...,...
Wirehaired Pointing Griffon,1,0,1,inf
Wirehaired Terrier,60,18,42,2.333333
Xoloitzcuintli / Mexican Hairless,11,4,7,1.750000
Yellow Labrador Retriever,158,62,96,1.548387


As we can see, there are many 'inf' values. It is correct because for that breed there are not any secondary breed.

## 4. For each (primary) breed, determine the earliest and the latest posted timestamp.

We use the group by method to solve this task starting from the latest posted timestamp.

In [23]:
df_new = dogs[dogs.groupby('breed_primary')['posted'].transform('max') == dogs['posted']]
df_new.loc[ :, ['breed_primary', 'posted']]

Unnamed: 0,breed_primary,posted
0,American Staffordshire Terrier,2019-09-20 16:37:59
6,Italian Greyhound,2019-09-20 06:42:30
749,American Bulldog,2019-09-20 16:06:25
1034,Pit Bull Terrier,2019-09-20 17:20:30
1035,Cocker Spaniel,2019-09-20 17:20:30
...,...,...
55810,Mountain Dog,2019-09-17 06:00:26
56127,Tosa Inu,2019-09-04 00:48:24
56696,Thai Ridgeback,2019-05-11 12:51:57
57459,Saint Bernard,2019-09-19 17:41:48


In [24]:
df_new = dogs[dogs.groupby('breed_primary')['posted'].transform('min') == dogs['posted']]
df_new.loc[ :, ['breed_primary', 'posted']]

Unnamed: 0,breed_primary,posted
693,Shiba Inu,2016-02-22 23:53:06
726,Siberian Husky,2011-01-14 21:58:52
728,Mastiff,2010-03-13 00:00:00
735,Boston Terrier,2009-03-23 00:00:00
738,Otterhound,2008-03-14 00:00:00
...,...,...
56902,American Hairless Terrier,2013-07-03 10:05:41
57449,Australian Shepherd,2008-11-06 00:00:00
57450,Xoloitzcuintli / Mexican Hairless,2007-02-01 00:00:00
58090,Presa Canario,2016-10-08 18:20:52


## 5. For each state, compute the sex imbalance, that is the difference between male and female dogs. In which state this imbalance is largest?

we create a table with count of male dogs for each state (contanct_state) 

In [25]:
dog_male=dogs[dogs['sex']=='Male']
dog_male=dog_male.groupby(['contact_state'])[['sex']].count()
dog_male.tail()

Unnamed: 0_level_0,sex
contact_state,Unnamed: 1_level_1
VT,276
WA,686
WI,277
WV,333
WY,31


we recreate the same table that we have done before with female dogs for each state

In [26]:
dog_female=dogs[dogs['sex']=='Female']
dog_female=dog_female.groupby(['contact_state'])[['sex']].count()
dog_female.tail()

Unnamed: 0_level_0,sex
contact_state,Unnamed: 1_level_1
VT,234
WA,598
WI,265
WV,232
WY,21


we create a new table call "dog_sex" thanks to merge between male dogs and female dogs on contact_state

In [27]:
dog_sex=pd.merge(dog_male,dog_female,on="contact_state")
dog_sex.rename( columns = {'sex_x':'tot_male','sex_y':'tot_female'})
dog_sex=dog_sex.rename( columns = {'sex_x':'tot_male','sex_y':'tot_female'})
dog_sex.tail()

Unnamed: 0_level_0,tot_male,tot_female
contact_state,Unnamed: 1_level_1,Unnamed: 2_level_1
VT,276,234
WA,686,598
WI,277,265
WV,333,232
WY,31,21


we compute the imbalance like difference between male and female dogs and after we calculate the absolute value of the imbalance given that we consider only positive value of imbalance for each contact_state

In [28]:
dog_sex["imbalance"]=dog_sex["tot_male"]-dog_sex["tot_female"]
dog_sex["imbalance"]=abs(dog_sex["imbalance"])
dog_sex.tail()

Unnamed: 0_level_0,tot_male,tot_female,imbalance
contact_state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
VT,276,234,42
WA,686,598,88
WI,277,265,12
WV,333,232,101
WY,31,21,10


we found the state with the largest imbalance

In [29]:
dog_sex["imbalance"].nlargest(1)

contact_state
OH    205
Name: imbalance, dtype: int64

## 6. For each pair (age, size), determine the average duration of the stay and the average cost of stay.

Compute the mean of the stay duration and stay cost for the pair AGE, SIZE;

In [30]:
AgeSize = dogs.groupby(['age', 'size'])[['stay_duration','stay_cost']].mean()
AgeSize

Unnamed: 0_level_0,Unnamed: 1_level_0,stay_duration,stay_cost
age,size,Unnamed: 2_level_1,Unnamed: 3_level_1
Adult,Extra Large,89.015414,232.591561
Adult,Large,89.531943,238.661141
Adult,Medium,89.421036,238.258977
Adult,Small,89.407479,238.974838
Baby,Extra Large,87.032967,237.180879
Baby,Large,89.701564,238.698827
Baby,Medium,89.577668,237.108131
Baby,Small,89.958291,239.08381
Senior,Extra Large,88.861111,235.232361
Senior,Large,88.984298,237.507364


## 7. Find the dogs involved in at least 3 travels. Also list the breed of those dogs.

Let's start grouping the dataset by the variable id

In [31]:
count=dogTravel.groupby('id').count()

We reset the index

In [32]:
count.reset_index(inplace=True)

Now we compute a list of only dogs that have made at least 2 trips

In [33]:
list_index=(count[count['index']>2])['id'].tolist()

Finally we build a dictionary having as key the identification code of the dog that has made at least 2 trips, and as value its primary breed. 
To do this a for loop will be used which compares each identification code of the dog in the staring dataset with those contained in the list created above



In [34]:
diz={}
for index in list_index:
    for i in range(len(dogs)):
        if (dogs.loc[i, 'id']==index):
            diz[dogs.loc[i, 'id']]= dogs.loc[i, 'breed_primary']
diz

{16657005: 'Pit Bull Terrier',
 20905974: 'Chow Chow',
 24894870: 'Hound',
 24894894: 'Hound',
 33218331: 'Alaskan Malamute',
 36978896: 'Alaskan Malamute',
 37108842: 'Pit Bull Terrier',
 37253070: 'Labrador Retriever',
 37848260: 'Labrador Retriever',
 38050885: 'Pit Bull Terrier',
 38495992: 'Pit Bull Terrier',
 38664932: 'Pit Bull Terrier',
 39608594: 'Pointer',
 40036107: 'Pit Bull Terrier',
 40103682: 'Rat Terrier',
 41144335: 'Chihuahua',
 41359772: 'German Shepherd Dog',
 42445043: 'Dutch Shepherd',
 42525486: 'Cairn Terrier',
 42778494: 'Jack Russell Terrier',
 42825396: 'Parson Russell Terrier',
 42835971: 'Pit Bull Terrier',
 43082511: 'Saluki',
 43379087: 'Labrador Retriever',
 43401994: 'Labrador Retriever',
 43529811: 'Mastiff',
 43576891: 'Labrador Retriever',
 43576898: 'Labrador Retriever',
 43618547: 'Alaskan Malamute',
 43679504: 'Pit Bull Terrier',
 43829075: 'American Bulldog',
 43956989: 'Mixed Breed',
 44070616: 'Alaskan Malamute',
 44077433: 'Labrador Retriever'

## 8. Fix the travels table so that the correct state is computed from the manual and the found fields. If manual is not missing, then it overrides what is stored in found.

To resolve the question, we create two different tables, the first table is compose with only observation field manual full.

In [35]:
manual_not_na=dogTravel[dogTravel['manual'].notna()]
manual_not_na

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there
1,1,44698509,Groveland,FL,Duke is an almost 2 year old Potcake from Abac...,Abacos,Bahamas,,
2,2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star lookin...,Adam,Maryland,,
6,6,45287347,Wooster,OH,"Tate is an adorable 2 year old, 22 pound Cocka...",Akron,Ohio,,
7,7,45287347,Wooster,OH,"Tate is an adorable 2 year old, 22 pound Cocka...",Akron,Ohio,,
330,330,45276595,Guntersville,AL,We call this little guy Bono. He is a happy a...,Albertville,Alabama,,True
...,...,...,...,...,...,...,...,...,...
6183,6183,45017651,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,West Virginia,,True
6184,6184,44659739,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,West Virginia,,True
6185,6185,44289536,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,West Virginia,,True
6187,6187,42117845,Fairmont,WV,This is Dachshund Chihuahua Blue who weighs 7l...,WV,West Virginia,,True


in this table we create a new column ("correct_state") with the same values in manual (cause are the values that will be overrides in found column)

In [36]:
manual_not_na["correct_state"] = manual_not_na["manual"]
manual_not_na

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  manual_not_na["correct_state"] = manual_not_na["manual"]


Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there,correct_state
1,1,44698509,Groveland,FL,Duke is an almost 2 year old Potcake from Abac...,Abacos,Bahamas,,,Bahamas
2,2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star lookin...,Adam,Maryland,,,Maryland
6,6,45287347,Wooster,OH,"Tate is an adorable 2 year old, 22 pound Cocka...",Akron,Ohio,,,Ohio
7,7,45287347,Wooster,OH,"Tate is an adorable 2 year old, 22 pound Cocka...",Akron,Ohio,,,Ohio
330,330,45276595,Guntersville,AL,We call this little guy Bono. He is a happy a...,Albertville,Alabama,,True,Alabama
...,...,...,...,...,...,...,...,...,...,...
6183,6183,45017651,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,West Virginia,,True,West Virginia
6184,6184,44659739,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,West Virginia,,True,West Virginia
6185,6185,44289536,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,West Virginia,,True,West Virginia
6187,6187,42117845,Fairmont,WV,This is Dachshund Chihuahua Blue who weighs 7l...,WV,West Virginia,,True,West Virginia


on second table, we subset only the observations with empty manual values

In [37]:
manual_na=dogTravel[dogTravel['manual'].isna()]
manual_na

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there
0,0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made hi...,Arkansas,,,
3,3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from an...,Adaptil,,True,
4,4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her peo...,Afghanistan,,,
5,5,43082511,Manchester,CT,Brooke has an unusual past. She was rescued f...,Afghanistan,,,
8,8,45987719,Locust Fork,AL,Meet Trixie... she is a female 2yr. Old Chihua...,Alabama,,,
...,...,...,...,...,...,...,...,...,...
6188,6188,41298157,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,,True,
6189,6189,40492179,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,,True,
6190,6190,45799729,Eagle Mountain,UT,Shiny is an approximately 4-6-year-old spayed ...,Wyoming,,,
6191,6191,34276515,Newnan,GA,Yanni is a Male Great Pyrenees that we rescue...,Yazmin,,True,


in this table we create a new column ("correct_state") with the same values in found (cause are the values that will be not overrides)

In [38]:
manual_na["correct_state"] = manual_na["found"]
manual_na

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  manual_na["correct_state"] = manual_na["found"]


Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there,correct_state
0,0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made hi...,Arkansas,,,,Arkansas
3,3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from an...,Adaptil,,True,,Adaptil
4,4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her peo...,Afghanistan,,,,Afghanistan
5,5,43082511,Manchester,CT,Brooke has an unusual past. She was rescued f...,Afghanistan,,,,Afghanistan
8,8,45987719,Locust Fork,AL,Meet Trixie... she is a female 2yr. Old Chihua...,Alabama,,,,Alabama
...,...,...,...,...,...,...,...,...,...,...
6188,6188,41298157,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,,True,,WV
6189,6189,40492179,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,,True,,WV
6190,6190,45799729,Eagle Mountain,UT,Shiny is an approximately 4-6-year-old spayed ...,Wyoming,,,,Wyoming
6191,6191,34276515,Newnan,GA,Yanni is a Male Great Pyrenees that we rescue...,Yazmin,,True,,Yazmin


subsequantly we recreate a new df with the concatenate of the two tables that we had created before, and we order them with index sort

In [39]:
df=pd.concat([manual_na,manual_not_na])
df=df.sort_index()
df

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there,correct_state
0,0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made hi...,Arkansas,,,,Arkansas
1,1,44698509,Groveland,FL,Duke is an almost 2 year old Potcake from Abac...,Abacos,Bahamas,,,Bahamas
2,2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star lookin...,Adam,Maryland,,,Maryland
3,3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from an...,Adaptil,,True,,Adaptil
4,4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her peo...,Afghanistan,,,,Afghanistan
...,...,...,...,...,...,...,...,...,...,...
6189,6189,40492179,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,,True,,WV
6190,6190,45799729,Eagle Mountain,UT,Shiny is an approximately 4-6-year-old spayed ...,Wyoming,,,,Wyoming
6191,6191,34276515,Newnan,GA,Yanni is a Male Great Pyrenees that we rescue...,Yazmin,,True,,Yazmin
6192,6192,44519341,Dayton,OH,Callie is a 14 year old Chihuahua whose owner ...,Young,Ohio,,,Ohio


now we have got a column correct_state complete with the values overrides and not overrides, so we replace that values at the column found of dogTravel dataframe and we ending the point

In [40]:
dogTravel["found"]=df["correct_state"]
dogTravel

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there
0,0,44520267,Anoka,MN,Boris is a handsome mini schnauzer who made hi...,Arkansas,,,
1,1,44698509,Groveland,FL,Duke is an almost 2 year old Potcake from Abac...,Bahamas,Bahamas,,
2,2,45983838,Adamstown,MD,Zac Woof-ron is a heartthrob movie star lookin...,Maryland,Maryland,,
3,3,44475904,Saint Cloud,MN,~~Came in to the shelter as a transfer from an...,Adaptil,,True,
4,4,43877389,Pueblo,CO,Palang is such a sweetheart. She loves her peo...,Afghanistan,,,
...,...,...,...,...,...,...,...,...,...
6189,6189,40492179,Fairmont,WV,Please contact Pet (information@pethelpersinc....,WV,,True,
6190,6190,45799729,Eagle Mountain,UT,Shiny is an approximately 4-6-year-old spayed ...,Wyoming,,,
6191,6191,34276515,Newnan,GA,Yanni is a Male Great Pyrenees that we rescue...,Yazmin,,True,
6192,6192,44519341,Dayton,OH,Callie is a 14 year old Chihuahua whose owner ...,Ohio,Ohio,,


In the last result, we have got in found column the manual values only if they are not "NaN" 

## 9. For each state, compute the ratio between the number of travels and the population.

### First Dataset Analysis

In [41]:
dogTravel.shape #number of total travell done

(6194, 9)

In [42]:
dogTravel.groupby(['contact_state']).count().shape 
#number of countries that have made trips

(45, 8)

In [43]:
nstest.shape #number of country in csv population

(51, 2)

51 states, 45 states that have made trips, total trips 6149

In [44]:
#change label
nTravelForState = dogTravel.groupby(['contact_state']).count()[['index']].reset_index().rename(columns={'contact_state':'state', 'index':'nTravel'})
nTravelForState.head()

Unnamed: 0,state,nTravel
0,17325,10
1,AL,75
2,AR,10
3,AZ,70
4,CA,28


In [45]:
nTravelForState['nTravel'].sum() #sum is correct

6194

In [46]:
nTravelForState.shape

(45, 2)

### There is an error

In [47]:
dogTravel[dogTravel['contact_state'] == '17325']

Unnamed: 0,index,id,contact_city,contact_state,description,found,manual,remove,still_there
3237,3237,36978896,PA,17325,Maddie is our little Miss Cutie Patootie! She ...,New York,,True,
3238,3238,33218331,PA,17325,"Born in August 2014, Bucky has a great sense o...",New York,,True,
3714,3714,36978896,PA,17325,Maddie is our little Miss Cutie Patootie! She ...,Pennsylvania,,True,
3715,3715,33218331,PA,17325,"Born in August 2014, Bucky has a great sense o...",Pennsylvania,,True,
6029,6029,36978896,PA,17325,Maddie is our little Miss Cutie Patootie! She ...,Virginia,,True,
6030,6030,33218331,PA,17325,"Born in August 2014, Bucky has a great sense o...",Virginia,,True,
6074,6074,36978896,PA,17325,Maddie is our little Miss Cutie Patootie! She ...,Washington DC,,True,
6075,6075,33218331,PA,17325,"Born in August 2014, Bucky has a great sense o...",Washington DC,,True,
6133,6133,36978896,PA,17325,Maddie is our little Miss Cutie Patootie! She ...,West Virginia,,True,
6134,6134,33218331,PA,17325,"Born in August 2014, Bucky has a great sense o...",West Virginia,,True,


The number refers to state of Pensylvania. It needs to be corrected

In [48]:
nTravelForState[nTravelForState['state']=='PA']

Unnamed: 0,state,nTravel
34,PA,316


In [49]:
#take the 2 values
PAerr = nTravelForState[nTravelForState['state'] == '17325']['nTravel']
PAcorr = nTravelForState[nTravelForState['state'] == 'PA']['nTravel']

In [50]:
#delete the wrong row
nTravelForState = nTravelForState.drop(nTravelForState.index[nTravelForState['state'] == '17325']).reset_index()

In [51]:
nTravelForState.head()

Unnamed: 0,index,state,nTravel
0,1,AL,75
1,2,AR,10
2,3,AZ,70
3,4,CA,28
4,5,CO,103


In [52]:
#Add the right value
nTravelForState.loc[nTravelForState['state']=='PA', 'nTravel'] = sum(PAcorr, PAerr).values
nTravelForState[nTravelForState['state'] == 'PA']

Unnamed: 0,index,state,nTravel
33,34,PA,326


#### Country that have made trips become 43

### Table State and Popolation

In [53]:
nstest[1] = nstest[1].str.replace('.','')

  nstest[1] = nstest[1].str.replace('.','')


In [54]:
state = nstest.rename(columns = {0:'state', 1:'nPopulation'})
state.head()

Unnamed: 0,state,nPopulation
0,Alabama,5024279
1,Alaska,733391
2,Arizona,7151502
3,Arkansas,3011524
4,California,39538223


#### Add state code

In [55]:
def findCountryAlpha2 (country_name):
    try:
        sub = pycountry.subdivisions.lookup(country_name)
        sample_str = sub.code
        stateCode = sub.code[-2:]
        return stateCode
    except:
        return ("not founded!")

state['state_code'] = state.apply(lambda row: findCountryAlpha2(row.state) , axis = 1)
state.head()

Unnamed: 0,state,nPopulation,state_code
0,Alabama,5024279,AL
1,Alaska,733391,AK
2,Arizona,7151502,AZ
3,Arkansas,3011524,AR
4,California,39538223,CA


### Still errors

#### MT

In [56]:
#Library pycountry has a code wrong for Montana state
state[state['state']=='Montana']

Unnamed: 0,state,nPopulation,state_code
26,Montana,1084225,12


In [57]:
state.loc[state['state']=='Montana', 'state_code'] = 'MT'
state[state['state']=='Montana']

Unnamed: 0,state,nPopulation,state_code
26,Montana,1084225,MT


#### MY

In [58]:
# another error in library pycountry referred to Maryland code state
state[state['state']=='Maryland']

Unnamed: 0,state,nPopulation,state_code
20,Maryland,6177224,MY


In [59]:
state.loc[state['state']=='Maryland', 'state_code'] = 'MD'
state[state['state']=='Maryland']

Unnamed: 0,state,nPopulation,state_code
20,Maryland,6177224,MD


#### CSV nstest

In [60]:
#there are some state_code NB which are the same state that NJ
nTravelForState[nTravelForState['state']=='NB']

Unnamed: 0,index,state,nTravel
23,24,NB,2


In [61]:
nTravelForState[nTravelForState['state']=='NJ']

Unnamed: 0,index,state,nTravel
26,27,NJ,552


In [62]:
NJerr = nTravelForState[nTravelForState['state'] == 'NB']['nTravel']
NJcorr = nTravelForState[nTravelForState['state'] == 'NJ']['nTravel']

In [63]:
#delete the wrong row
nTravelForState = nTravelForState.drop(nTravelForState.index[nTravelForState['state'] == 'NB']).reset_index()

In [64]:
#salvo il valore in NJ
nTravelForState.loc[nTravelForState['state']=='NJ', 'nTravel'] = sum(NJcorr, NJerr).values
nTravelForState[nTravelForState['state'] == 'NJ']

Unnamed: 0,level_0,index,state,nTravel
25,26,27,NJ,554


#### Now the number of country that have done trips are 43

### Merging time

In [65]:
table_travel_for_state = pd.merge(state, nTravelForState, left_on='state_code', right_on='state')[['state_x', 'nPopulation', 'nTravel']]
table_travel_for_state.head()

Unnamed: 0,state_x,nPopulation,nTravel
0,Alabama,5024279,75
1,Arizona,7151502,70
2,Arkansas,3011524,10
3,California,39538223,28
4,Colorado,5773714,103


In [66]:
#check if sum is correct
table_travel_for_state['nTravel'].sum()

6194

### Add ratio between the number of travels and the population

In [67]:
table_travel_for_state['nPopulation'] = table_travel_for_state['nPopulation'].apply(pd.to_numeric)

In [68]:
table_travel_for_state['ratio_Travel_Population'] = table_travel_for_state['nPopulation'] / table_travel_for_state['nTravel']
table_travel_for_state.head()

Unnamed: 0,state_x,nPopulation,nTravel,ratio_Travel_Population
0,Alabama,5024279,75,66990.39
1,Arizona,7151502,70,102164.3
2,Arkansas,3011524,10,301152.4
3,California,39538223,28,1412079.0
4,Colorado,5773714,103,56055.48


## 10. For each dog, compute the number of days from the posted day to the day of last access.

To solve this task the difference between the column 'posted' and the column 'accessed' will be calculated.
The information obtained will be stored in a dictionary having the dog's identification code as the key and 
the days between the posted day and the last access as value.



In [69]:
diz={}
for i in range(len(dogs)):
    diz[dogs.loc[i,'id']]= -1*(dogs.loc[i, 'posted'] - dogs.loc[i, 'accessed']).days
diz

{46042150: 0,
 46042002: 0,
 46040898: 0,
 46039877: 0,
 46039306: 0,
 46039304: 0,
 46039303: 0,
 46039302: 0,
 46039301: 0,
 46038709: 0,
 46038708: 0,
 46038703: 0,
 46038700: 0,
 46038243: 0,
 46038070: 0,
 46038064: 0,
 46038065: 0,
 46038067: 0,
 46038068: 0,
 46038060: 0,
 46038062: 0,
 46038063: 0,
 46038061: 0,
 46037951: 0,
 46037918: 0,
 46037881: 0,
 46037860: 0,
 46037820: 0,
 46037762: 0,
 46037742: 0,
 46037637: 0,
 46037534: 0,
 46036459: 1,
 46035351: 1,
 46035350: 1,
 46035353: 1,
 46035346: 1,
 46035344: 1,
 46035342: 1,
 46034532: 1,
 46033962: 1,
 46032651: 1,
 46032592: 1,
 46032594: 1,
 46032595: 1,
 46032596: 1,
 46032588: 1,
 46032587: 1,
 46032589: 1,
 46032253: 1,
 46031946: 1,
 46031507: 1,
 46031797: 1,
 46031796: 1,
 46029444: 1,
 46029446: 1,
 46028152: 1,
 46027977: 1,
 46027945: 1,
 46027921: 1,
 46027872: 1,
 46027804: 1,
 46027303: 1,
 46026629: 1,
 46026616: 1,
 46026600: 1,
 46026454: 1,
 46026507: 1,
 46026395: 1,
 46026306: 1,
 46026195: 1,
 46026

## 11. Partition the dogs according to the number of weeks from the posted day to the day of last access.

We create a new column "weeks" and we add them on the dogs table, we will increase them later

In [70]:
dogs["week"]=0 
dogs["week"]

0        0
1        0
2        0
3        0
4        0
        ..
58175    0
58176    0
58177    0
58178    0
58179    0
Name: week, Length: 58180, dtype: int64

on this part we compute the weeks like difference between accessed date and posted date;  after we divide the result for seven (cause evry week is composed by seven days)

In [71]:
i=0
for i in range(len(dogs)):
    dogs.loc[i,'week']= ((dogs.loc[i, 'accessed'] - dogs.loc[i, 'posted']).days)/7
    i=i+1
dogs["week"]=round(dogs["week"],0)
dogs["week"]


0        -0.0
1        -0.0
2        -0.0
3        -0.0
4        -0.0
         ... 
58175    20.0
58176    23.0
58177    51.0
58178    53.0
58179    54.0
Name: week, Length: 58180, dtype: float64

in the end we partition the dogs according to the number of weeks thank to group by method

In [72]:
dogs.groupby("week").count()['id']

week
-0.0      7988
 1.0      7053
 2.0      4789
 3.0      4950
 4.0      2649
          ... 
 730.0       1
 747.0       1
 812.0       1
 813.0       1
 853.0       1
Name: id, Length: 580, dtype: int64

## 12. Find for duplicates in the dogs dataset. Two records are duplicates if they have (1) same breeds and sex, and (2) they share at least 90% of the words in the description field. Extra points if you find and implement a more refined for determining if two rows are duplicates.

### 1) for version 

In [73]:
# 1) build a table with only interesting column
duplicate_table = dogs.iloc[:, [0,13,5,34]]
duplicate_table

Unnamed: 0,id,sex,breed_primary,description
0,46042150,Male,American Staffordshire Terrier,Harley is not sure how he wound up at shelter ...
1,46042002,Male,Pit Bull Terrier,6 year old Biggie has lost his home and really...
2,46040898,Male,Shepherd,Approx 2 years old.\n Did I catch your eye? I ...
3,46039877,Female,German Shepherd Dog,
4,46039306,Male,Dachshund,Theo is a friendly dachshund mix who gets alon...
...,...,...,...,...
58175,44605893,Male,Border Collie,"Due to the small size of our volunteer base, w..."
58176,44457061,Female,Australian Shepherd,
58177,42865848,Female,Border Collie,"Due to the small size of our volunteer base, w..."
58178,42734734,Male,Boxer,


In [74]:
#Delete all row with NaN value as despription 
duplicate_table = duplicate_table[duplicate_table['description'].notna()]

In [75]:
duplicate_female = duplicate_table[duplicate_table['sex'] == 'Female']
duplicate_male = duplicate_table[duplicate_table['sex'] == 'Male']

In [76]:
#reorder row for breed-primary
duplicate_female = duplicate_female.sort_values(by = 'breed_primary').reset_index()
duplicate_male = duplicate_male.sort_values(by = 'breed_primary').reset_index()

In [77]:
def scannerDuplicate(tab):
    duplicate = {}
    for i in range(len(tab)): #all the external row
        breed1 = tab.iloc[i]['breed_primary']
        desc1 = tab.iloc[i]['description']
        identi1 = tab.iloc[i]['id']

        for j in range(i+1,len(tab)): #all the row inside breed selected
            breed2 = tab.iloc[j]['breed_primary']
            desc2 = tab.iloc[j]['description']
            identi2 = tab.iloc[j]['id']

            if breed1 == breed2: #if equals -> compare
                if SequenceMatcher(None, desc1, desc2).ratio() >= 0.9:
                    duplicate[identi1]= 'duplicate'
                    duplicate[identi2]= 'duplicate'
            else: #different breed, break j and go on with i
                break
            
            
    return duplicate

In [78]:
prova_female = duplicate_female.iloc[0:300]
scannerDuplicate(prova_female)

{45970614: 'duplicate',
 45871731: 'duplicate',
 46023964: 'duplicate',
 46023963: 'duplicate',
 42087185: 'duplicate',
 43170920: 'duplicate',
 45986905: 'duplicate',
 45986935: 'duplicate',
 44518435: 'duplicate',
 44517341: 'duplicate',
 44130324: 'duplicate',
 45932298: 'duplicate',
 45931729: 'duplicate',
 46028451: 'duplicate',
 46028452: 'duplicate',
 46028453: 'duplicate',
 45560548: 'duplicate',
 45570899: 'duplicate',
 45949697: 'duplicate',
 46021044: 'duplicate',
 45252099: 'duplicate',
 45255573: 'duplicate',
 45872595: 'duplicate',
 45897563: 'duplicate',
 45897570: 'duplicate'}

### 2) Text Mining version

In [79]:
df1=dogs
df1 = df1.dropna(subset=['breed_secondary'])
df1 = df1.dropna(subset=['description'])
df1.index = df1.reset_index(drop=True).index 

#importazione librerie utili 
import nltk
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
simili=[]
coppia=[]

#ordino il dataset in ordine alfabetico per la variabile breed_primary 
df1.sort_values('breed_primary')
#memorizzo i possibili valori della varibile breed_primary in una lista
l=set(df1.loc[:, 'breed_primary'])

for el in l:
            corpus=[]
            #creo dei sotto dataset formati solo da osservazioni con lo stesso valore di breed_primary 
            subset=df1[df1['breed_primary']==el]
            #inserisco l'indice da 0 a len(n) nel nuovo sotto dataset
            subset.index = subset.reset_index(drop=True).index 
            if (len(subset)!=0):
                for idx in subset.index:
                    #elimina doppioni
                    title = re.sub('[^a-zA-Z]', ' ', subset.loc[idx, 'description'])
                    #trasforma tutto il minuscolo 
                    title = title.lower()
                    #tokenizzazione 
                    title = title.split()

                    #title = [ps.stem(word) for word in title if not word in stopwords.words('english')]
                    title = ' '.join(title)
                    corpus.append(title)
                    #Stopwords Removal
                    vect = TfidfVectorizer(min_df=1, stop_words="english")   
                    #Feature weighting
                    tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
                    pairwise_similarity = tfidf * tfidf.T
                    finale=[]
                    #Matrix of similarity to list (become a list of lists)
                    l=pairwise_similarity.A.tolist()
                for lista in l:
                    for el in lista:
                        # sometimes 1 become 0.99
                        if 0.90<=el<0.99:
                            tupla=(l.index(lista), lista.index(el))
                            finale.append(tupla)
                for tuple in finale:
                    for i in range(len(tuple)):
                        if (subset.loc[tuple[0], 'breed_secondary']==subset.loc[tuple[1], 'breed_secondary'] and subset.loc[tuple[0], 'sex']==subset.loc[tuple[1], 'sex']):
                                t=(subset.loc[tuple[0], 'id'], subset.loc[tuple[1], 'id'])
                                #use the id since the index is different between the matrix and the dataset df1
                                simili.append(t)
print(simili)

[(46040486, 46040487), (46040486, 46040487), (46040487, 46040486), (46040487, 46040486), (45300758, 45300755), (45300758, 45300755), (45300755, 45300758), (45300755, 45300758), (45182452, 45175036), (45182452, 45175036), (45182452, 45175036), (45182452, 45175036), (45182449, 45174934), (45182449, 45174934), (45182449, 45174934), (45182449, 45174934), (45947304, 45947300), (45947304, 45947300), (45947300, 45947304), (45947300, 45947304), (44174656, 44174649), (44174656, 44174649), (44174649, 44174656), (44174649, 44174656), (46004914, 46004225), (46004914, 46004225), (46004225, 46004914), (46004225, 46004914), (45831312, 45831443), (45831312, 45831443), (45831443, 45831312), (45831443, 45831312), (45795183, 45795175), (45795183, 45795175), (45569788, 45569541), (45569788, 45569541), (45569541, 45569788), (45569541, 45569788), (45521864, 45521701), (45521864, 45521701), (45521701, 45521864), (45521701, 45521864), (45175036, 45182452), (45175036, 45182452), (45174934, 45182449), (45174934

In [80]:
print(len(set(simili))) #without duplicate

806
