# Data Cleaning

In [1]:
# As always
import pandas as pd
import numpy as np

### Building a dataframe by hand

In [2]:
some_data = [1, 2, 3, 4, 5] # list of numbers
some_more_data = ['a', 'b', 'c', 'd', 'e'] # list of letters
some_booleans = [True, False, True, True, True] # list of booleans
df = pd.DataFrame({'numbers':some_data, 'letters':some_more_data, 'bools':some_booleans})
df

Unnamed: 0,numbers,letters,bools
0,1,a,True
1,2,b,False
2,3,c,True
3,4,d,True
4,5,e,True


We can checkout the datatypes of each column using `df.info()`

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   numbers  5 non-null      int64 
 1   letters  5 non-null      object
 2   bools    5 non-null      bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 213.0+ bytes


### Changing data types

In [6]:
classes  = ['CS1111', 'PSYC1010', 'CS2150', 'ECON2010', 'SOC2010']
courseforum_ratings = ['4', 3.8, '1.2', 2, '4'] # Note: Some of these are strings, floats, and ints
lou = pd.DataFrame({'courses':classes, 'ratings':courseforum_ratings})
lou

Unnamed: 0,courses,ratings
0,CS1111,4.0
1,PSYC1010,3.8
2,CS2150,0.2
3,ECON2010,2.0
4,SOC2010,4.0


Say we want to double the rating values

In [7]:
lou['ratings_2x'] = lou['ratings']*2
lou

Unnamed: 0,courses,ratings,ratings_2x
0,CS1111,4.0,44
1,PSYC1010,3.8,7.6
2,CS2150,0.2,0.20.2
3,ECON2010,2.0,4
4,SOC2010,4.0,44


That's not what we expected- Let's drop that column

In [8]:
lou = lou.drop('ratings_2x', axis=1)
lou

Unnamed: 0,courses,ratings
0,CS1111,4.0
1,PSYC1010,3.8
2,CS2150,0.2
3,ECON2010,2.0
4,SOC2010,4.0


Two important things to notice here:
1. We've specified the axis on which we'd like to drop `ratings_2x`. Since it's a column, we specify `axis=1`
    - In general for axes: Rows = 0, Columns = 1
2. We saved the dataframe that `.drop()` returns back onto the original dataframe to save it

Let's try scaling ratings again

Before, our `ratings` column had mixed datatypes, floats and strings. 
<br>In python, when we multiply a string, we repeat it, so we got `"44"` instead of `8` for the first entry

In [11]:
# For example:
x = 'hello'
print(x * 2)

y = 2.71
print(y * 2)

hellohello
5.42


To change the datatype, we can convert each element in the `ratings` series

In [16]:
lou['ratings'] = lou['ratings'].apply(lambda x: float(x))

lou['ratings_2x'] = lou.ratings * 2
lou

Unnamed: 0,courses,ratings,ratings_2x
0,CS1111,4.0,8.0
1,PSYC1010,3.8,7.6
2,CS2150,0.2,0.4
3,ECON2010,2.0,4.0
4,SOC2010,4.0,8.0


Pandas offers us some easier functions to convert data types, such as the `.astype()` function

<br> Check out what happens when we pass `'str'`, `'float'`, and `'int'` to `.astype()`

In [28]:
lou['ratings'] = lou['ratings'].astype('str')
lou['ratings_2x_str'] = lou.ratings * 2
lou

Unnamed: 0,courses,ratings,ratings_2x,ratings_2x_str
0,CS1111,4,8.0,44
1,PSYC1010,3,7.6,33
2,CS2150,0,0.4,0
3,ECON2010,2,4.0,22
4,SOC2010,4,8.0,44


In [29]:
lou['ratings'] = lou['ratings'].astype('float')
lou

Unnamed: 0,courses,ratings,ratings_2x,ratings_2x_str
0,CS1111,4.0,8.0,44
1,PSYC1010,3.0,7.6,33
2,CS2150,0.0,0.4,0
3,ECON2010,2.0,4.0,22
4,SOC2010,4.0,8.0,44


In [27]:
lou['ratings'] = lou['ratings'].astype('int')
lou

Unnamed: 0,courses,ratings,ratings_2x,ratings_2x_str
0,CS1111,4,8.0,4.04.0
1,PSYC1010,3,7.6,3.83.8
2,CS2150,0,0.4,0.20.2
3,ECON2010,2,4.0,2.02.0
4,SOC2010,4,8.0,4.04.0


## Handling missing values

Missing data is a common issue, and an incredible headache in machine learning.

<br> Take, for example, this dataframe with several `NaN` and `None` values, both of which can be used to represent missing data

In [38]:
students = ['Student_A', 'Student_B', 'Student_C', 'Student_D', np.nan, 'Student_F', 'Student_G']
years = [1, np.nan, 3, None, 4, np.nan, 1]
majors = ['Econ', 'PolySci', 'CS', 'Phil', np.nan, 'Chemistry', np.nan]
df = pd.DataFrame({'student':students,'year' :years, 'major':majors})
df

Unnamed: 0,student,year,major
0,Student_A,1.0,Econ
1,Student_B,,PolySci
2,Student_C,3.0,CS
3,Student_D,,Phil
4,,4.0,
5,Student_F,,Chemistry
6,Student_G,1.0,


Note how pandas automatically convered `None` to `np.nan`- The latter is what we'll usually to work with

### Identifying missing values

We can use `.isna()` on a dataframe or series

In [40]:
df.isna()

Unnamed: 0,student,year,major
0,False,False,False
1,False,True,False
2,False,False,False
3,False,True,False
4,True,False,True
5,False,True,False
6,False,False,True


If we wanted to count the number of nulls for each feature (aka column) in our dataset,
<br> we'd have to take sum **across** rows 0 thru 6: Which axis # should we specify to sum across then?

<br> (Hint): Peep the general rule of thumb from above

In [43]:
df.isna().sum(axis=0)

student    1
year       3
major      2
dtype: int64

If we wanted instead to see how many null values we have for every student, we'd swap the axis:

In [44]:
df.isna().sum(axis=1)

0    0
1    1
2    0
3    1
4    2
5    1
6    1
dtype: int64

Remember, we want a breakdown *by row*, so we're counting the number of rows *across* each column

### Strategy 1: Dropping rows with missing values

By far the easiest way of dealing with missing data is just dropping rows that have missing values.

By default, `dropna()` will drop rows that have a null value for *any* column

In [45]:
df_dropped = df.dropna()
df_dropped

Unnamed: 0,student,year,major
0,Student_A,1.0,Econ
2,Student_C,3.0,CS


You might also need to drop NAs only if it's present for a single column. 
<br>In this case, we can pass the `subset=` argument to `.dropna()`

In [46]:
df.dropna(subset=['student'])

Unnamed: 0,student,year,major
0,Student_A,1.0,Econ
1,Student_B,,PolySci
2,Student_C,3.0,CS
3,Student_D,,Phil
5,Student_F,,Chemistry
6,Student_G,1.0,


### Strategy 2: Filling in missing values

We can fill in missing values in a variety of different ways. 
<br> We can use a specific value (like the mean), forward-fill, back-fill, or use a variety of more advanced imputation methods such as K-nearest neighbors (KNN)

Replacing missing values with a specific value:

In [47]:
df.fillna(0) # replace missing values with 0

Unnamed: 0,student,year,major
0,Student_A,1.0,Econ
1,Student_B,0.0,PolySci
2,Student_C,3.0,CS
3,Student_D,0.0,Phil
4,0,4.0,0
5,Student_F,0.0,Chemistry
6,Student_G,1.0,0


You can also use this to replace missing values with the mean of the values:

In [48]:
df_copy = df
df_copy['year'] = df['year'].fillna(df['year'].mean())
df_copy

Unnamed: 0,student,year,major
0,Student_A,1.0,Econ
1,Student_B,2.25,PolySci
2,Student_C,3.0,CS
3,Student_D,2.25,Phil
4,,4.0,
5,Student_F,2.25,Chemistry
6,Student_G,1.0,


In [51]:
df['student'] = df['student'].fillna('Unknown Student')
df

Unnamed: 0,student,year,major
0,Student_A,1.0,Econ
1,Student_B,2.25,PolySci
2,Student_C,3.0,CS
3,Student_D,2.25,Phil
4,Unknown Student,4.0,
5,Student_F,2.25,Chemistry
6,Student_G,1.0,


## String processing

Often, a dataset will contain string representations of data that could be really useful if you could find some way to extract it. 

<br> Let's start off with a dataframe

In [116]:
people = ['Christian Jung', 'Ishaan Dey', 'Carter Bristow', 'Chris Santamaria', 
          'Shawn Weigand', 'Jasmine Dogu', 'Nithin Vijayakumar', 'John Doe', 'Ben H.'] 
classes = ['Node Pro', 'node', 'Node lite', 'deploy', 'Source', 'Node', 'node lite', 'Source Lite', 'node lite']

courses_df = pd.DataFrame({'person':people, 'section':classes})
courses_df

Unnamed: 0,person,section
0,Christian Jung,Node Pro
1,Ishaan Dey,node
2,Carter Bristow,Node lite
3,Chris Santamaria,deploy
4,Shawn Weigand,Source
5,Jasmine Dogu,Node
6,Nithin Vijayakumar,node lite
7,John Doe,Source Lite
8,Ben H.,node lite


It'd be great if we could work with just the first names of everyone. 
With normal python strings, this is pretty easy to do using the `.split()` function:

In [117]:
name = 'Christian Jung'
print(name)
print(name.split())
print(name.split()[0])

Christian Jung
['Christian', 'Jung']
Christian


Let's try using that to extract the first names from the column `person`

In [118]:
courses_df.person.split()

AttributeError: 'Series' object has no attribute 'split'

Looks like we got an error: We can't use split() on the series object directly.
<br><br> Instead, we have to "vectorize it" using `.str`  first

In [119]:
courses_df.person.str.split()

0        [Christian, Jung]
1            [Ishaan, Dey]
2        [Carter, Bristow]
3      [Chris, Santamaria]
4         [Shawn, Weigand]
5          [Jasmine, Dogu]
6    [Nithin, Vijayakumar]
7              [John, Doe]
8                [Ben, H.]
Name: person, dtype: object

Before we move on, check out the object type of the output using `type()`

### Lambda apply functions

Lambda apply functions are a pretty helpful tool for cleaning, here's one quick example

In [120]:
courses_df['first_name'] = courses_df.person.str.split().apply(lambda x: x[0])
courses_df

Unnamed: 0,person,section,first_name
0,Christian Jung,Node Pro,Christian
1,Ishaan Dey,node,Ishaan
2,Carter Bristow,Node lite,Carter
3,Chris Santamaria,deploy,Chris
4,Shawn Weigand,Source,Shawn
5,Jasmine Dogu,Node,Jasmine
6,Nithin Vijayakumar,node lite,Nithin
7,John Doe,Source Lite,John
8,Ben H.,node lite,Ben


Quite literally, this reads: <br>

For every element `x` in the series `couses_df.person`, take the first element of `x` and save it to a new column `first_name`
<br> In other words, we are **apply**ing the *anonymous* function `x[0]` for every value in the series

### Changing capitalization to better process text

Let's look at how many people of the 7 here are teaching each section! 

In [121]:
courses_df.section.value_counts()

node lite      2
deploy         1
Node lite      1
Node           1
node           1
Source Lite    1
Source         1
Node Pro       1
Name: section, dtype: int64

We should be putting `node lite` and `Node lite` together, but because of a mismatch in cases, we're getting unique results.

<br>An easy way to solve this is by converting all the text to a uniform case

In [122]:
"Jasmine Dogu".upper()

'JASMINE DOGU'

In [123]:
courses_df['section'] = courses_df['section'].str.upper()
courses_df

Unnamed: 0,person,section,first_name
0,Christian Jung,NODE PRO,Christian
1,Ishaan Dey,NODE,Ishaan
2,Carter Bristow,NODE LITE,Carter
3,Chris Santamaria,DEPLOY,Chris
4,Shawn Weigand,SOURCE,Shawn
5,Jasmine Dogu,NODE,Jasmine
6,Nithin Vijayakumar,NODE LITE,Nithin
7,John Doe,SOURCE LITE,John
8,Ben H.,NODE LITE,Ben


In [124]:
courses_df.section.value_counts()

NODE LITE      3
NODE           2
SOURCE LITE    1
DEPLOY         1
SOURCE         1
NODE PRO       1
Name: section, dtype: int64

**Try it out**: How many folks are staffing the overall node course offering (i.e. Node, Node Pro, Node Lite)?

<br>Hint: Google the documentation for `pd.Series.str`

In [134]:
node_df.section.apply(lambda x: x.split()[0]).value_counts()

node_df.section.str.contains('NODE').sum()

3

## Date & Time processing

In [135]:
presidents = ['Washington' ,'Lincoln', 'Kennedy', 'Obama', 'Trump']
birthdays = ['Feb 27 1732', '2-12-1809', 'May 29th, 1917', '8 4 1961','06//14// //1946' ]

bdays = pd.DataFrame({'president': presidents, 'birthday': birthdays})
bdays

Unnamed: 0,president,birthday
0,Washington,Feb 27 1732
1,Lincoln,2-12-1809
2,Kennedy,"May 29th, 1917"
3,Obama,8 4 1961
4,Trump,06//14// //1946


Yikes! Let's see if we can clean up the time series data using `pd.to_datetime`

In [137]:
bdays['birthday'] = pd.to_datetime(bdays['birthday'])
bdays

Unnamed: 0,president,birthday
0,Washington,1732-02-27
1,Lincoln,1809-02-12
2,Kennedy,1917-05-29
3,Obama,1961-08-04
4,Trump,1946-06-14


As you can see, `pd.to_datetime` is pretty powerful. In can read in quite a few time formats as strings, then convert them into a `Timestamp` series

In [141]:
type(bdays.birthday[0])

pandas._libs.tslibs.timestamps.Timestamp

Unfortunately, there are some formats `pd.to_datetime()` won't recognize on its own

In [151]:
presidents = ['Washington' ,'Lincoln', 'Kennedy']
birthdays = ['2###27adjf1732', '2###12adjf1809', '5###05adjf1917']

bdays_messy = pd.DataFrame({'president': presidents, 'birthday': birthdays})
bdays_messy

Unnamed: 0,president,birthday
0,Washington,2###27adjf1732
1,Lincoln,2###12adjf1809
2,Kennedy,5###05adjf1917


In [149]:
bdays_messy['birthday'] = pd.to_datetime(bdays_messy['birthday'])
bdays_messy

ParserError: Unknown string format: 2###27adjf1732

However, we CAN specify the format ourselves! 
<br><br>Look up the documentation (google!) to see if we can pass any parameters to help it along

In [150]:
bdays_messy['birthday'] = pd.to_datetime(bdays_messy['birthday'], format = '%m###%dadjf%Y')
bdays_messy

Unnamed: 0,president,birthday
0,Washington,1732-02-27
1,Lincoln,1809-02-12
2,Kennedy,1917-05-05


### Using pandas datetime objects

We can pull quite a lot just from a datetime timestamp using attributes

In [156]:
washington = bdays.at[0, 'birthday'] # Taking the value for washington's bday
print(washington) # the raw timestamp

1732-02-27 00:00:00


In [164]:
washington.month # The month, encoded as an int

2

In [165]:
washington.month_name()

'February'

In [166]:
washington.year

1732

In [167]:
washington.is_leap_year

True

In [168]:
washington.daysinmonth

29

### Make new columns from these datetime attributes 

Let's use this to make new columns that reflect these attributes:

In [172]:
bdays['month'] = bdays.birthday.apply(lambda x: x.month_name())
bdays

Unnamed: 0,president,birthday,month,is_leap,day
0,Washington,1732-02-27,February,True,2
1,Lincoln,1809-02-12,February,False,6
2,Kennedy,1917-05-29,May,False,1
3,Obama,1961-08-04,August,False,4
4,Trump,1946-06-14,June,False,4


Try it out with the others:

In [173]:
bdays['is_leap'] = bdays.birthday.apply(lambda x: x.is_leap_year)
bdays['day'] = bdays.birthday.apply(lambda x: x.dayofweek)
bdays

Unnamed: 0,president,birthday,month,is_leap,day
0,Washington,1732-02-27,February,True,2
1,Lincoln,1809-02-12,February,False,6
2,Kennedy,1917-05-29,May,False,1
3,Obama,1961-08-04,August,False,4
4,Trump,1946-06-14,June,False,4


### Filtering with datetimes

Let's say we want to subset the dataframe for presidents born *before* WWI started (July 28th, 1914)

In [181]:
bdays[bdays.birthday < pd.to_datetime('July 28th, 1914')]

Unnamed: 0,president,birthday,month,is_leap,day
0,Washington,1732-02-27,February,True,2
1,Lincoln,1809-02-12,February,False,6


We can also do some quick maths quite easily:

<br> For example, how much older was Washington than Lincoln?

In [183]:
diff = bdays.birthday[1] - bdays.birthday[0]

print(diff)
print(type(diff))


28109 days 00:00:00
<class 'pandas._libs.tslibs.timedeltas.Timedelta'>


Note: this returns a `Timedelta` object, not `Timestamp`. We can get similar attributes

In [195]:
diff.days

28109

In [196]:
diff.days / 365.25

76.95824777549623

### How can we apply this?

When you get data that includes time as a variable, it'll be in one of many possible formats, and not always consistent throughout the whole dataset. 


`pd.to_datetim`e makes the process of cleaning these incredibly easy!

Once cleaned, we can look at specific attributes such as month, day, and year **to gain insight we wouldn't otherwise have been able to access.**

There's a lot, lot more you can do with pandas datetimes - use business days, adjust for time zones - just about anything you'd imagine.

The docs for all of that is linked here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#overview

## Merging DataFrames

Merging sources of data is super important:
    
<br> Sometimes you have data from two different sources that you'd like to have in one data frame to analyze. We can do that with `.concat()` and `.merge()`

In [323]:
ratings_df = pd.DataFrame({'Dining': ['Castle', 'Newcomb', 'Chick-fil-A', 'Five Guys', 'Runk', 'Subway'], 
                        'Rating': [4, 3, 5, 4, 4, 3]})
ratings_df = ratings_df.sort_values(by='Rating')
ratings_df

Unnamed: 0,Dining,Rating
1,Newcomb,3
5,Subway,3
0,Castle,4
3,Five Guys,4
4,Runk,4
2,Chick-fil-A,5


In [324]:
prices_df = pd.DataFrame({'Dining': ['Castle', 'Newcomb', 'Chick-fil-A', 'Five Guys', 'Runk', 'Subway'], 
                        'Price': [8.5, 9.5, 6.5, 7.75, 9.5, 6]})
prices_df = prices_df.sort_values(by='Price')
prices_df

Unnamed: 0,Dining,Price
5,Subway,6.0
2,Chick-fil-A,6.5
3,Five Guys,7.75
0,Castle,8.5
1,Newcomb,9.5
4,Runk,9.5


In [325]:
locations_df = pd.DataFrame({'Dining': ['Castle', 'Newcomb', 'Chick-fil-A', 'Five Guys', 'Runk', 'Subway'],
                            'Location': ['Old dorms', 'Central Grounds', 'Central Grounds', 'Central Grounds', 'Gooch-Dillard', 'Central Grounds']})
locations_df = locations_df.sort_values(by='Location')
locations_df

Unnamed: 0,Dining,Location
1,Newcomb,Central Grounds
2,Chick-fil-A,Central Grounds
3,Five Guys,Central Grounds
5,Subway,Central Grounds
4,Runk,Gooch-Dillard
0,Castle,Old dorms


Note that each one of these dataframes have a column in common, `Dining`.

The order of the values may not be same, but we're still good to go

### pd.Merge()

Instead of working with three distinct dataframes, let's combine them into one df

To do so, we can call `.merge()` on two data tables and specify the column on which to merge as `on=`

In [326]:
df = pd.merge(ratings_df, prices_df, on='Dining')
df

Unnamed: 0,Dining,Rating,Price
0,Newcomb,3,9.5
1,Subway,3,6.0
2,Castle,4,8.5
3,Five Guys,4,7.75
4,Runk,4,9.5
5,Chick-fil-A,5,6.5


By chaining them we can combine multiple

In [327]:
df = pd.merge(ratings_df, prices_df, on='Dining').merge(locations_df, on='Dining')
df

Unnamed: 0,Dining,Rating,Price,Location
0,Newcomb,3,9.5,Central Grounds
1,Subway,3,6.0,Central Grounds
2,Castle,4,8.5,Old dorms
3,Five Guys,4,7.75,Central Grounds
4,Runk,4,9.5,Gooch-Dillard
5,Chick-fil-A,5,6.5,Central Grounds


### Join logic

In [329]:
more_ratings_df = pd.DataFrame({'Dining': ["Newcomb", "Subway", "Starbucks", "Burrito Theory", "O'Hill"], 
                                'Rating': [3, 3, 4, 4, 3]})
more_ratings_df

Unnamed: 0,Dining,Rating
0,Newcomb,3
1,Subway,3
2,Starbucks,4
3,Burrito Theory,4
4,O'Hill,3


With the previous merges, we had the same number of observations in every dataframe.

<br>With some merges, not every row may align. Let's try to merge `more_ratings_df` with `prices_df`. Note how there are some dining halls in common, and some unique to each

In [338]:
print(more_ratings_df.Dining.unique())
print(prices_df.Dining.unique())

# Only Newcomb and Subway are common to both

['Newcomb' 'Subway' 'Starbucks' 'Burrito Theory' "O'Hill"]
['Subway' 'Chick-fil-A' 'Five Guys' 'Castle' 'Newcomb' 'Runk']


We can do two different merges now: 
1. If we want to retain **only** those in common, we use an `inner` join
2. If we want to keep **everything**, and keep placeholders for missing data, we use an `outer` join

In [341]:
pd.merge(prices_df, more_ratings_df, on='Dining', how='inner')

Unnamed: 0,Dining,Price,Rating
0,Subway,6.0,3
1,Newcomb,9.5,3


In [342]:
pd.merge(prices_df, more_ratings_df, on='Dining', how='outer')

Unnamed: 0,Dining,Price,Rating
0,Subway,6.0,3.0
1,Chick-fil-A,6.5,
2,Five Guys,7.75,
3,Castle,8.5,
4,Newcomb,9.5,3.0
5,Runk,9.5,
6,Starbucks,,4.0
7,Burrito Theory,,4.0
8,O'Hill,,3.0


### pd.Concat()

Another *similar* function is `.concat()` 

It's a little different from `.merge()`, since we'll have to pass in a `list` of dataframes instead

In [344]:
df = pd.concat([df1, df2, df3])
df

Unnamed: 0,Dining,Rating,Price,Location
1,Newcomb,3.0,,
5,Subway,3.0,,
0,Castle,4.0,,
3,Five Guys,4.0,,
4,Runk,4.0,,
2,Chick-fil-A,5.0,,
5,Subway,,6.0,
2,Chick-fil-A,,6.5,
3,Five Guys,,7.75,
0,Castle,,8.5,


That didn't quite work as expected, because `concat()` stacked the dataframes above each other, instead of combining information for common rows.

Note that it **didn't combine rows** when `merge()` easily could have.

One example of when `concat()` is appropriate is when we want to add on more information to a dataframe, but the **rows are the different** between the two

In [345]:
more_ratings_df

Unnamed: 0,Dining,Rating
0,Newcomb,3
1,Subway,3
2,Starbucks,4
3,Burrito Theory,4
4,O'Hill,3


In [348]:
df1

Unnamed: 0,Dining,Rating
1,Newcomb,3
5,Subway,3
0,Castle,4
3,Five Guys,4
4,Runk,4
2,Chick-fil-A,5


In [347]:
df1_all = pd.concat([df1, more_ratings]).reset_index(drop=True) # The reset_index() allows us to prevent overlapping of the indices
df1_all

Unnamed: 0,Dining,Rating
0,Newcomb,3
1,Subway,3
2,Castle,4
3,Five Guys,4
4,Runk,4
5,Chick-fil-A,5
6,O'Hill,3
7,Starbucks,4
8,N2Go,4
9,Burrito Theory,3


`Concat` can also horizontally stack dataframes, usng the `axis=1` argument. 

Here's a case where it might be useful:

In [350]:
more_info_df = pd.DataFrame({'Popularity': [8, 5, 10, 7, 8, 7], 'Hours': ["7:00-9:00", "7:00-8:00", "11:00-8:00", "11:00-8:00", "7:00-8:00", "11:00-8:00"]})
more_info_df

Unnamed: 0,Popularity,Hours
0,8,7:00-9:00
1,5,7:00-8:00
2,10,11:00-8:00
3,7,11:00-8:00
4,8,7:00-8:00
5,7,11:00-8:00


In [352]:
df = pd.concat([df1, more_info_df], axis=1)
df

Unnamed: 0,Dining,Rating,Popularity,Hours
0,Castle,4,8,7:00-9:00
1,Newcomb,3,5,7:00-8:00
2,Chick-fil-A,5,10,11:00-8:00
3,Five Guys,4,7,11:00-8:00
4,Runk,4,8,7:00-8:00
5,Subway,3,7,11:00-8:00


Note the difference between `.concat(axis=1)` and `.merge()`. We would use `.concat()` when there isn't a duplicate column, and `.merge()` when there is one.