# Data Analysis with Pandas

By Liliana Torres
&nbsp; 


## What is Pandas and what do we use it for? 

Pandas is one of those package that makes importing and analyzing data much easier. Pandas builds on packages like NumPy and matplotlib to give you a single, convenient, place to do most of your data analysis and visualization work. 

![](images/QuotePandas.jpg)

### Exploring the Data 
We will be working with a business that offer pet services. We have certain questions from busines that we would like to answer..

 * people_person : this file has details of each user on the site. it might contain pet owners, pet sitters or people that has not done any transactions on the site
 * pets_pet : this file contains each pet that a user has added to their profile. One owner might have more than one pet, but not viceversa.
 * services_service: on the site users might offer pet care services. This file has stored each record of that service that is offered. Each user can have more than one service but not one of each type.
 * conversations_conversation : An owner can book a service provider by starting a conversation with them. This will store each conversation started on the platform.
 * conversations_conversation_pets : Since a booking my involve many pets and many pets might had many bookings, it is neccesary to store this many to many relationship.
 * conversations_message : Each conversation consists of a series of messages. A conversation may contain many messages, but not viceversa: 
 * conversations_review : if a booking ocurrs, then either participant can leave a review for the experience. This file has records of those reviews, which consists of a brief statement and a star rating.
 

###   * Loading the data into a dataframes

In [3]:
# importing the libraries
import pandas as pd
import numpy as np

In [4]:
#Read the CSV data
people_person = pd.read_csv ('../DataAnalysisPandas/datasets/people_person.csv')
pets_pet = pd.read_csv ('../DataAnalysisPandas/datasets/pets_pet.csv')
services_service = pd.read_csv ('../DataAnalysisPandas/datasets/services_service.csv')
conversations_conversation = pd.read_csv ('../DataAnalysisPandas/datasets/conversations_conversation.csv')
conversations_conversation_pets = pd.read_csv ('../DataAnalysisPandas/datasets/conversations_conversation_pets.csv')
conversations_message = pd.read_csv ('../DataAnalysisPandas/datasets/conversations_message.csv')
conversations_review = pd.read_csv ('../DataAnalysisPandas/datasets/conversations_review.csv')

### * We will begin answering some questions to the business to get to know more the data

1) How many users have signed up ?

In [5]:
# lets take a look of the people person  data
people_person.head()

Unnamed: 0,id,first_name,last_name,email,channel,date_joined,photo,fee,gender
0,1,Leanora,Allcock,leanora.allcock635@hotmail.com,,2016-08-02 14:59:15.095591,https://placekitten.com/242/269,0.0,f
1,2,Elroy,Blanding,elroy.blanding510@yahoo.com,,2016-08-02 18:15:30.105940,https://placekitten.com/373/320,0.0,m
2,3,Jeanice,Aleman,jeanice.aleman392@hotmail.com,,2016-08-02 16:11:09.542004,https://placekitten.com/238/264,0.0,f
3,4,Tamala,Polhamus,tamala.polhamus146@aol.com,,2016-08-02 18:02:40.389299,https://placekitten.com/220/223,0.0,f
4,5,Alethea,Gubler,alethea.gubler708@aol.com,,2016-08-02 14:31:53.163034,https://placekitten.com/284/339,0.0,f


In [6]:
count_users = people_person.id.count()
count_users

64393

In [7]:
# you can also count the users for one level multindex
count_users1 = people_person.set_index(["id", "gender"]).count(level = "gender")
count_users1

Unnamed: 0_level_0,first_name,last_name,email,channel,date_joined,photo,fee
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
f,32075,32075,32075,25781,32075,32075,32075
m,32318,32318,32318,25874,32318,32318,32318


2) How many users signed up prior to 2018-02-03 ?

In [8]:
users_prior = people_person [(people_person['date_joined'] < '2018-02-03')].id.count()
users_prior

35826

 3) What percentaje of users have added pets ?

Lets take a look at the data in the two csv

In [9]:
pets_pet.head()

Unnamed: 0,id,name,description,gender,weight,birthday,plays_cats,plays_children,plays_dogs,spayed_neutered,house_trained,size,owner_id
0,1,Jammie,Morbi fames a mauris elit malesuada platea.,f,76,2016-05-26,1,1,1,1,1,large,12601
1,2,Lonnie,Class magna a libero felis sociosqu.,f,12,2014-05-20,0,1,1,1,0,small,12602
2,3,Emely,Felis class.,m,11,2014-08-21,0,1,1,1,0,small,12602
3,4,Emelia,Fames class egestas mollis risus posuere.,f,35,2013-09-23,1,1,1,0,0,medium,12603
4,5,Jami,Netus augue a congue orci.,m,35,2014-05-13,0,1,1,1,1,medium,12603


In [10]:
people_person.head()

Unnamed: 0,id,first_name,last_name,email,channel,date_joined,photo,fee,gender
0,1,Leanora,Allcock,leanora.allcock635@hotmail.com,,2016-08-02 14:59:15.095591,https://placekitten.com/242/269,0.0,f
1,2,Elroy,Blanding,elroy.blanding510@yahoo.com,,2016-08-02 18:15:30.105940,https://placekitten.com/373/320,0.0,m
2,3,Jeanice,Aleman,jeanice.aleman392@hotmail.com,,2016-08-02 16:11:09.542004,https://placekitten.com/238/264,0.0,f
3,4,Tamala,Polhamus,tamala.polhamus146@aol.com,,2016-08-02 18:02:40.389299,https://placekitten.com/220/223,0.0,f
4,5,Alethea,Gubler,alethea.gubler708@aol.com,,2016-08-02 14:31:53.163034,https://placekitten.com/284/339,0.0,f


In [11]:
total_added_pet=pets_pet.owner_id.nunique()
total_added_pet

51793

In [12]:
total_users=people_person.id.nunique()
total_users

64393

In [13]:
pecentage_added = total_added_pet * 100 / total_users
pecentage_added

80.4326557234482

4)  Of those users, how many they have been added on average ?

In [14]:
# we will use need to use the total_user and calculate from those how many have been added 
total_added_pet_all = pets_pet.id.count()
total_added_pet_all

77730

In [15]:
added_average = total_added_pet_all /total_added_pet
added_average

1.500781958951982

5) What percentage of pets play well with cats?

In [16]:
total_plays_cats = (pets_pet.plays_cats.sum() * 100 ) / pets_pet.plays_cats.count()
total_plays_cats

24.850122217933873

6) Book ratings, what is the average and how does it look the the ratings total per stars ?


In [22]:
#lets look at the data related to it
conversations_review.head()

Unnamed: 0,id,content,stars,conversation_id,reviewer_id
0,1,Netus proin per duis dolor venenatis nam.,1,7,64386
1,2,Dolor proin donec phasellus ve suspendisse ac ...,5,9,64384
2,3,Proin ipsum urna nisl egestas justo class a ar...,5,11,64382
3,4,Porta velit lectus varius donec tellus sollici...,1,13,64381
4,5,Dolor felis.,2,15,64379


In [23]:
conversations_conversation.head()

Unnamed: 0,id,start_date,end_date,units,added,booking_total,cancellation_fault,requester_id,service_id,booked_at,cancelled_at
0,1,2018-07-26,2018-07-31,5,2018-07-16 10:17:53.460035,120,,64393,4646,,
1,2,2018-08-10,2018-08-16,6,2018-08-01 10:20:48.626868,132,,64392,10126,,
2,3,2018-06-16,2018-06-19,3,2018-06-05 16:46:39.542467,168,,64391,20677,,
3,4,2018-07-13,2018-07-20,7,2018-07-02 09:12:22.275923,490,,64391,3847,,
4,5,2018-07-02,2018-07-07,5,2018-06-21 16:02:48.694725,140,,64389,9982,,


In [24]:
people_person.head()

Unnamed: 0,id,first_name,last_name,email,channel,date_joined,photo,fee,gender
0,1,Leanora,Allcock,leanora.allcock635@hotmail.com,,2016-08-02 14:59:15.095591,https://placekitten.com/242/269,0.0,f
1,2,Elroy,Blanding,elroy.blanding510@yahoo.com,,2016-08-02 18:15:30.105940,https://placekitten.com/373/320,0.0,m
2,3,Jeanice,Aleman,jeanice.aleman392@hotmail.com,,2016-08-02 16:11:09.542004,https://placekitten.com/238/264,0.0,f
3,4,Tamala,Polhamus,tamala.polhamus146@aol.com,,2016-08-02 18:02:40.389299,https://placekitten.com/220/223,0.0,f
4,5,Alethea,Gubler,alethea.gubler708@aol.com,,2016-08-02 14:31:53.163034,https://placekitten.com/284/339,0.0,f


In [26]:
mergedDf = conversations_conversation.merge(people_person, left_on='requester_id', right_on='id')
finalDf = mergedDf.merge(conversations_review, how="left",left_on='requester_id', right_on='reviewer_id')

In [28]:
removes=finalDf[(pd.notnull(finalDf["booked_at"])) & (pd.isnull(finalDf["cancelled_at"]))]
#removes.shape

In [29]:
removes.groupby(['stars'])['reviewer_id'].count()

stars
1.0     2080
2.0     2097
3.0     2118
4.0     2108
5.0    27568
Name: reviewer_id, dtype: int64

7) Stars review average 

In [31]:
conversations_review["stars"].mean()

4.307377192675326

8) For the services how many do we have that offers that can provide injected medication. Do we have enough that can take care of that for the pets?

In [31]:
# Let's take a look of the data in services
services_service.head()

Unnamed: 0,id,service_type,cancellation_policy,can_provide_oral_medication,can_provide_injected_medication,senior_dog_experience,special_needs_experience,takes_small_dogs,takes_medium_dogs,takes_large_dogs,takes_puppies,max_dogs,provider_id,fee,price,added
0,1,boarding,strict,1,1,1,1,0,1,1,1,4,1,0.15,35,2016-08-02 14:59:15.095591
1,2,dog-walking,strict,1,0,1,1,0,0,1,1,5,1,0.15,26,2016-08-02 14:59:15.095591
2,3,boarding,moderate,0,0,1,0,0,0,1,1,2,2,0.15,31,2016-08-02 18:15:30.105940
3,4,dog-walking,strict,1,0,1,0,1,0,0,1,5,2,0.15,27,2016-08-02 18:15:30.105940
4,5,day-care,strict,1,0,1,1,0,1,1,1,5,2,0.15,30,2016-08-02 18:15:30.105940


In [32]:
#lets pivot the data to take a look at it
pvt = services_service.pivot_table(index=['service_type'], columns=['can_provide_injected_medication',], values='id', aggfunc='count')
pvt


can_provide_injected_medication,0,1
service_type,Unnamed: 1_level_1,Unnamed: 2_level_1
boarding,5623,1410
day-care,5709,1439
dog-walking,5717,1500


9) What is the total of bookings we have per service

In [58]:
services_service.head() #ids are here

Unnamed: 0,id,service_type,cancellation_policy,can_provide_oral_medication,can_provide_injected_medication,senior_dog_experience,special_needs_experience,takes_small_dogs,takes_medium_dogs,takes_large_dogs,takes_puppies,max_dogs,provider_id,fee,price,added
0,1,boarding,strict,1,1,1,1,0,1,1,1,4,1,0.15,35,2016-08-02 14:59:15.095591
1,2,dog-walking,strict,1,0,1,1,0,0,1,1,5,1,0.15,26,2016-08-02 14:59:15.095591
2,3,boarding,moderate,0,0,1,0,0,0,1,1,2,2,0.15,31,2016-08-02 18:15:30.105940
3,4,dog-walking,strict,1,0,1,0,1,0,0,1,5,2,0.15,27,2016-08-02 18:15:30.105940
4,5,day-care,strict,1,0,1,1,0,1,1,1,5,2,0.15,30,2016-08-02 18:15:30.105940


In [59]:
merge_services = removes.merge(services_service[["service_type", "id"]], left_on='service_id', right_on='id')
#removes.head()
merge_services.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41001 entries, 0 to 41000
Data columns (total 27 columns):
id_x                  41001 non-null int64
start_date            41001 non-null object
end_date              41001 non-null object
units                 41001 non-null int64
added                 41001 non-null datetime64[ns]
booking_total         41001 non-null int64
cancellation_fault    0 non-null object
requester_id          41001 non-null int64
service_id            41001 non-null int64
booked_at             41001 non-null object
cancelled_at          0 non-null object
id_y                  41001 non-null int64
first_name            41001 non-null object
last_name             41001 non-null object
email                 41001 non-null object
channel               33047 non-null object
date_joined           41001 non-null object
photo                 41001 non-null object
fee                   41001 non-null float64
gender                41001 non-null object
id_x            

In [60]:
merge_services.groupby(["service_type"]).sum().sort_values("service_type", ascending=False).booking_total

service_type
dog-walking    1981831
day-care       2024484
boarding       3277752
Name: booking_total, dtype: int64