# EDA and data cleaning

In this notebook, we will go over the data we have and perform EDA and data cleaning.

In [51]:
# imports
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns 
import re


## Dropping unnecessary columns
We will start our cleaning process by dropping the columns we are sure we dont need.

In [52]:
# read the preprocessed data
df = pd.read_csv('./../data/austin_listings_processed.csv')
print(f'the size of our data is {df.shape}')
df.head(2)

the size of our data is (47037, 81)


  df = pd.read_csv('./../data/austin_listings_processed.csv')


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,occ_rate_calendar,active_duration_days,occ_rate_70,occ_rate_50,occ_rate_30,time_quarter
0,5456,https://www.airbnb.com/rooms/5456,20231215200307,2023-12-16,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,,My neighborhood is ideally located if you want...,https://a0.muscache.com/pictures/14084884/b5a3...,8028,...,1,0,0,3.71,0.3,5390.0,0.7,0.7,0.7,Q4
1,5769,https://www.airbnb.com/rooms/5769,20231215200307,2023-12-16,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,https://a0.muscache.com/pictures/23822033/ac94...,8186,...,0,1,0,1.76,0.7,5404.0,0.388601,0.544041,0.7,Q4


In [53]:
# print list of the columns
print(list(df.columns))


['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availabil

Let us retain the columns we might wanna use later. Only drop the columns that won't be used for sure. 

In [54]:

columns_to_keep = ['id', 'source', 'name', 'description','neighborhood_overview',
                   'host_is_superhost', 'neighbourhood_cleansed', 'latitude',
                   'longitude', 'property_type', 'room_type', 'accommodates',
                   'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
                   'minimum_nights', 'maximum_nights', 'number_of_reviews',
                   'review_scores_rating','occ_rate_50', 'time_quarter']
df = df[columns_to_keep]
# rename columns if needed
df.rename(columns={'neighbourhood_cleansed': 'zipcode',
                   'occ_rate_50': 'occupancy_rate'}, inplace=True)
df.head(3)


Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,latitude,longitude,property_type,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter
0,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,,My neighborhood is ideally located if you want...,t,78702,30.26057,-97.73441,Entire guesthouse,...,,2.0,[],$101.00,2,90,668,4.84,0.7,Q4
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,,1.0,[],,1,14,294,4.91,0.544041,Q4
2,6413,previous scrape,Guesthouse in Austin · ★4.97 · Studio · 1 bed ...,,Travis Heights is one of the oldest neighborho...,f,78704,30.24885,-97.73587,Entire guesthouse,...,,1.0,[],,30,90,120,4.97,0.7,Q4


Let us start our analysis by looking at the nans.

In [55]:
df.isna().sum().sort_values(ascending=False)

bedrooms                 17307
neighborhood_overview    16070
description              12515
price                     2135
host_is_superhost         2058
beds                       363
bathrooms_text              34
id                           0
occupancy_rate               0
review_scores_rating         0
number_of_reviews            0
maximum_nights               0
minimum_nights               0
amenities                    0
accommodates                 0
source                       0
room_type                    0
property_type                0
longitude                    0
latitude                     0
zipcode                      0
name                         0
time_quarter                 0
dtype: int64

Since price is a very import feature in our data, let us dive deeper into why it has 2000 missing values.

## Missing prices

In [56]:
df[df['price'].isna()].head(5)

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,latitude,longitude,property_type,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,,1.0,[],,1,14,294,4.91,0.544041,Q4
2,6413,previous scrape,Guesthouse in Austin · ★4.97 · Studio · 1 bed ...,,Travis Heights is one of the oldest neighborho...,f,78704,30.24885,-97.73587,Entire guesthouse,...,,1.0,[],,30,90,120,4.97,0.7,Q4
14,69810,previous scrape,Guesthouse in Austin · ★4.98 · 1 bedroom · 1 b...,,"Located in the cool Dawson area, about two mil...",f,78704,30.2309,-97.76619,Entire guesthouse,...,,1.0,[],,2,730,445,4.98,0.7,Q4
18,219168,previous scrape,Rental unit in Austin · ★4.88 · Studio · 1 bed...,,Despite its proximity to South Congress and th...,f,78704,30.24519,-97.74486,Entire rental unit,...,,1.0,[],,2,60,8,4.88,0.04308,Q4
19,72833,previous scrape,Guesthouse in Austin · ★4.91 · 1 bedroom · 1 b...,,Peaceful neighborhood street in central Austin...,t,78731,30.313,-97.75066,Entire guesthouse,...,,1.0,[],,3,60,413,4.91,0.7,Q4


Let us see if we can find further insight towards what listings have missing prices.

In [57]:
df[df['price'].isna()].describe(include='object')

Unnamed: 0,source,name,description,neighborhood_overview,host_is_superhost,property_type,room_type,bathrooms_text,amenities,price,time_quarter
count,2135,2135,0.0,1333,2134,2135,2135,2132,2135,0.0,2135
unique,2,1151,0.0,1276,2,40,4,18,1,0.0,1
top,previous scrape,Rental unit in Austin · 1 bedroom · 1 bed · 1 ...,,"Great location, close to everything Austin has...",f,Entire home,Entire home/apt,1 bath,[],,Q4
freq,2115,110,,11,1828,631,1555,970,2135,,2135


In [58]:
df[df['price'].isna()].describe()


Unnamed: 0,id,zipcode,latitude,longitude,accommodates,bedrooms,beds,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate
count,2135.0,2135.0,2135.0,2135.0,2135.0,0.0,2122.0,2135.0,2135.0,2135.0,2135.0,2135.0
mean,1.153142e+17,78723.140984,30.275237,-97.743172,3.998126,,2.052309,5.290867,542.600937,23.096956,4.794993,0.189055
std,2.717539e+17,20.665209,0.058512,0.050071,2.380016,,1.447688,29.952152,527.063859,50.841213,0.444607,0.230274
min,5769.0,78701.0,30.13013,-98.00946,1.0,,1.0,1.0,1.0,1.0,0.0,0.00185
25%,13169790.0,78704.0,30.239023,-97.764025,2.0,,1.0,1.0,15.0,2.0,4.775,0.024379
50%,23680260.0,78722.0,30.26535,-97.73799,4.0,,2.0,2.0,365.0,6.0,4.97,0.076225
75%,47038120.0,78744.0,30.29657,-97.716055,5.5,,3.0,3.0,1125.0,21.5,5.0,0.266056
max,1.029898e+18,78759.0,30.50632,-97.575828,16.0,,14.0,1100.0,1125.0,687.0,5.0,0.7


In [59]:
df[df['price'].isna()]['id'].nunique()

2135

As we can see, all of the missing prices have happened during data scraping in Q4 so there could be some issues at that time (they all have unique id's as well). Looking further, we can see that none of the listings with missing price have description either but this may not a significant finding because description column has a lot of missing values, in addition to the ones with no price (similar observation for bedrooms).   

Let us see what is the total number of unique id's for our listings.

In [60]:
df['id'].nunique()

15175

This means we will have a lot of listings that have been listed at different times. We might be able to use price information for that specific property by looking at the prices it was listed for at other times. 

In [61]:
missing_price_ids = list(df[df['price'].isna()]['id'].unique())

In [62]:
df[df['id'] == missing_price_ids[0]]

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,latitude,longitude,property_type,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,,1.0,[],,1,14,294,4.91,0.544041,Q4
12265,5769,previous scrape,Home in Austin · ★4.90 · 1 bedroom · 1 bed · 1...,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,,1.0,"[""Wifi \u2013 47 Mbps"", ""Lock on bedroom door""...",$45.00,1,14,290,4.9,0.549346,Q3
24012,5769,previous scrape,Home in Austin · ★4.90 · 1 bedroom · 1 bed · 1...,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,,78729,30.45697,-97.78422,Private room in home,...,,1.0,"[""Host greets you"", ""Essentials"", ""Shampoo"", ""...",$42.00,1,14,285,4.9,0.546291,Q2
35716,5769,previous scrape,NW Austin Room,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,1.0,1.0,"[""Private backyard"", ""Free parking on premises...",$42.00,1,14,275,4.9,0.539957,Q1


Let us replace the missing price values with that of the same listing on other times periods. 

In [63]:
#convert currency columns to float when avaialble
df['price'] = df['price'].str.replace('[\$,]', '', regex=True).astype(float)
df['price'].dtype

dtype('float64')

In [64]:
# make a dictionary where the key is the index id for the missing prices and value is the average price for that listing
# based on other times of the year
avrg_price = {}
for i in missing_price_ids:
    avrg_price[i] = df[df['id'] == i]['price'].mean()

# replace the missings with average from other dates
df.loc[df['price'].isna(),'price'] = df[df['price'].isna()].apply(lambda x: avrg_price[x['id']], axis=1)

In [65]:
df['price'].isna().sum()

33

As we can see, we have replaced most of the listings with missing prices with some values from the same listing at other times (data missing not at random). Now, we can go ahead and drop the remaining nan's in the price column. 

In [66]:
df = df.dropna(subset=['price'])
df.isna().sum().sort_values(ascending=False)

bedrooms                 17274
neighborhood_overview    16056
description              12482
host_is_superhost         2058
beds                       363
bathrooms_text              34
id                           0
occupancy_rate               0
review_scores_rating         0
number_of_reviews            0
maximum_nights               0
minimum_nights               0
price                        0
amenities                    0
accommodates                 0
source                       0
room_type                    0
property_type                0
longitude                    0
latitude                     0
zipcode                      0
name                         0
time_quarter                 0
dtype: int64

## Missing values in bedrooms

An interesting trend we are observing is that we already know all airbnb listings should have information about the number of bedrooms. However, when we look at the listing name column, we can see very interesting information about the home, number of bedrooms, beds, and bathrooms are provided. 

In [67]:
df.iloc[0]['name']

'Guesthouse in Austin · ★4.84 · 1 bedroom · 2 beds · 1 bath'

From this pattern, we can see that the '.' character can be used to separate the string upon. Let us first make sure that all the names in our listing follow the same convention. 

In [68]:
df['name'].apply(lambda x: len(x.split('·'))).value_counts()

5    29971
1    11317
4     5620
3       90
2        6
Name: name, dtype: int64

As we can see, there are a lot of listings that don't follow this naming convenction, let's take a look at them.

In [69]:
df[df['name'].apply(lambda x: len(x.split('·'))) == 1].head(2)

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,latitude,longitude,property_type,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter
35715,5456,city scrape,"Walk to 6th, Rainey St and Convention Ctr",Great central location for walking to Convent...,My neighborhood is ideally located if you want...,t,78702,30.26057,-97.73441,Entire guesthouse,...,1.0,2.0,"[""Heating"", ""Backyard"", ""Bed linens"", ""Hot wat...",176.0,2,90,630,4.84,0.7,Q1
35716,5769,previous scrape,NW Austin Room,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,1.0,1.0,"[""Private backyard"", ""Free parking on premises...",42.0,1,14,275,4.9,0.539957,Q1


Let us look at the first listing from the list above.

In [70]:
df[df['id'] ==5456]

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,latitude,longitude,property_type,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter
0,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,,My neighborhood is ideally located if you want...,t,78702,30.26057,-97.73441,Entire guesthouse,...,,2.0,[],101.0,2,90,668,4.84,0.7,Q4
12264,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,Week of July 31 - Aug 10 require min of 5 nig...,My neighborhood is ideally located if you want...,t,78702,30.26057,-97.73441,Entire guesthouse,...,1.0,2.0,"[""Air conditioning"", ""Hot water"", ""Dishes and ...",126.0,2,90,657,4.84,0.7,Q3
24007,5456,city scrape,Guesthouse in Austin · ★4.84 · 1 bedroom · 2 b...,Great central location for walking to Convent...,My neighborhood is ideally located if you want...,t,78702,30.26057,-97.73441,Entire guesthouse,...,1.0,2.0,"[""HDTV with Amazon Prime Video, HBO Max, Hulu,...",107.0,2,90,648,4.84,0.7,Q2
35715,5456,city scrape,"Walk to 6th, Rainey St and Convention Ctr",Great central location for walking to Convent...,My neighborhood is ideally located if you want...,t,78702,30.26057,-97.73441,Entire guesthouse,...,1.0,2.0,"[""Heating"", ""Backyard"", ""Bed linens"", ""Hot wat...",176.0,2,90,630,4.84,0.7,Q1


As we can see, this property has some other listings as well and it looks like after Q1, airbnb has decided to follow the listing convention of using '.' in their names. let us check another listing. 

In [71]:
df[df['id'] ==5769]

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,latitude,longitude,property_type,...,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter
1,5769,previous scrape,Home in Austin · ★4.91 · 1 bedroom · 1 bed · 1...,,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,,1.0,[],43.0,1,14,294,4.91,0.544041,Q4
12265,5769,previous scrape,Home in Austin · ★4.90 · 1 bedroom · 1 bed · 1...,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,,1.0,"[""Wifi \u2013 47 Mbps"", ""Lock on bedroom door""...",45.0,1,14,290,4.9,0.549346,Q3
24012,5769,previous scrape,Home in Austin · ★4.90 · 1 bedroom · 1 bed · 1...,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,,78729,30.45697,-97.78422,Private room in home,...,,1.0,"[""Host greets you"", ""Essentials"", ""Shampoo"", ""...",42.0,1,14,285,4.9,0.546291,Q2
35716,5769,previous scrape,NW Austin Room,<b>The space</b><br />Looking for a comfortabl...,Quiet neighborhood with lots of trees and good...,t,78729,30.45697,-97.78422,Private room in home,...,1.0,1.0,"[""Private backyard"", ""Free parking on premises...",42.0,1,14,275,4.9,0.539957,Q1


Let us see how many missing values for bedrooms do we have for listing where the name is using the new convention of using '.' and indicating the number of beds and baths. 

In [72]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (5 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 5]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (5 dots)
13730


In [73]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (4 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 4]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (4 dots)
2861


In [74]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (3 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 3]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (3 dots)
65


In [75]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (2 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 2]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (2 dots)
1


In [76]:
print('number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (1 dots)')
print(df[df['name'].apply(lambda x: len(x.split('·'))) == 1]['bedrooms'].isna().sum())

number of listing with missing bedrooms that use "." to separate the number of rooms in the listing name (1 dots)
617


It looks like most of the listings that have the number of bedrooms missing should have some indication of the number of bedrooms in their name listing. We will have to regex to extract the bedroom information from this data. The following regex will extract the first number before the words 'bedroom', 'Bedroom', 'bedrooms', or 'Bedrooms'. 

In [77]:
df['bedrooms_extracted'] = df['name'].apply(lambda x: re.search(r'\D*(\d+\.\d+|\d+)\D*(?:bedroom|bedrooms|Bedroom|Bedrooms)',x).group(1) if 
                                            re.search(r'\D*(\d+\.\d+|\d+)\D*(?:bedroom|bedrooms|Bedroom|Bedrooms)',x) else np.nan)                           


In [78]:
# convert the string into int if they are not nan
df['bedrooms_extracted'] = pd.to_numeric(df['bedrooms_extracted'], errors='coerce').astype('Int64')


In [79]:
df['bedrooms_extracted'].value_counts()

1        14505
2         8923
3         6456
4         3030
5          949
6          335
7          146
8           60
9           30
10          23
13          10
12           8
23           4
14           4
15           3
1940         2
11           1
78704        1
33           1
Name: bedrooms_extracted, dtype: Int64

In [80]:
df['bedrooms_extracted'].isna().sum()

12513

Now, let us take a look at the column beedrooms and bedrooms_extracted together and see how many listings dont have values in either of these columns.

In [81]:
df[df['bedrooms'].isna() & df['bedrooms_extracted'].isna()].shape

(2434, 24)

In [82]:
df[df['bedrooms'].isna() & df['bedrooms_extracted'].isna()].head(5)

Unnamed: 0,id,source,name,description,neighborhood_overview,host_is_superhost,zipcode,latitude,longitude,property_type,...,beds,amenities,price,minimum_nights,maximum_nights,number_of_reviews,review_scores_rating,occupancy_rate,time_quarter,bedrooms_extracted
2,6413,previous scrape,Guesthouse in Austin · ★4.97 · Studio · 1 bed ...,,Travis Heights is one of the oldest neighborho...,f,78704,30.24885,-97.73587,Entire guesthouse,...,1.0,[],91.666667,30,90,120,4.97,0.7,Q4,
18,219168,previous scrape,Rental unit in Austin · ★4.88 · Studio · 1 bed...,,Despite its proximity to South Congress and th...,f,78704,30.24519,-97.74486,Entire rental unit,...,1.0,[],250.0,2,60,8,4.88,0.04308,Q4,
22,224603,city scrape,Guesthouse in Austin · ★4.94 · Studio · 1 bed ...,,,t,78751,30.30732,-97.72782,Entire guesthouse,...,1.0,[],110.0,60,365,116,4.94,0.7,Q4,
43,78884,city scrape,Rental unit in Austin · ★4.78 · Studio · 2 bed...,,The Barton Hills neighborhood is located in Au...,t,78704,30.26126,-97.77358,Entire rental unit,...,2.0,[],101.0,2,90,199,4.78,0.473922,Q4,
107,343889,city scrape,Rental unit in Austin · ★4.75 · Studio · 1 bed...,,"Friendly, walkable residential neighborhood wi...",f,78702,30.27293,-97.72288,Entire rental unit,...,1.0,[],60.0,30,1125,36,4.75,0.505027,Q4,


It seems like a lot of these places could be studios. To dig further into it, let us not consider shared or private rooms and just focus on 'Entire home/apt' in the room_type section. 

In [85]:
df[df['bedrooms'].isna() & df['bedrooms_extracted'].isna() & df['room_type'].str.contains('Entire home/apt')].shape

(2352, 24)

Let us take a look at the name of the listing in this subset and see how many of them have the word 'studio' in the naming. 

In [88]:
df[df['bedrooms'].isna() & df['bedrooms_extracted'].isna() & 
   df['room_type'].str.contains('Entire home/apt') &
   df['name'].str.contains('Studio')
   ].shape

(1838, 24)

Similar to what we did before, let us make a dictionary with listings that have 4 or 5 sections in their name and see if we can replace the name of the ones with less sections in their name with this new convention. 

In [None]:
# make a dictionary where the key is the index id for the missing prices and value is the average price for that listing
# based on other times of the year
name_4_sections_id = df[df['name'].apply(lambda x: len(x.split('·'))) == 4]['id'].unique()
name_4_sections = {}


In [None]:

for i in missing_price_ids:
    avrg_price[i] = df[df['id'] == i]['price'].mean()

# replace the missings with average from other dates
df.loc[df['price'].isna(),'price'] = df[df['price'].isna()].apply(lambda x: avrg_price[x['id']], axis=1)

In [None]:
df['name'].apply(lambda x: len(x.split('·')))

In [None]:
df[df['name'].apply(lambda x: len(x.split('·'))) == 3].head(3)

In [None]:
df[df['id'] ==978089]

In [None]:
re.findall(r'\d+', df.iloc[0]['name'].split('·')[2])

In [None]:
re.findall(r'\b\d+\.\d+\b', df.iloc[0]['name'].split('·')[1])

In [None]:
re.findall(r'(\d+\.\d+|\d+)\s*(?:year|Years)', '2.5 2.5 Years')[0]

In [None]:
re.findall(r'\D*(\d+\.\d+|\d+)\D*(?:year|Year)', '2.55 aa-- Years')

In [None]:
re.match(r'\D*(\d+\.\d+|\d+)\D*(?:year|Year)', '2.55 aa-- Years')