## What is data science?

In [1]:
from IPython.display import Image
from IPython.core.display import HTML 

The term 'data science' has in recent times really become a buzzword. We hear about data scientists and data analyists being highly sought after and it becoming a really popular branch of IT.

But what exactly is data science and what are the skills neccessary to become someone who analyzes data?



Data science is kind of an interdisciplinary field, which means it is at an intersection of multiple disciplines. At the centre of it is data - the most valuable commodity of today.

Data science seeks to analyze and process data in such a way that new knowledge can be extracted from it. This knowledge serves to improve businesses and advance science by helping the decision making process (what should we do?), making predictions (what will hapen next?) and discover patterns (find hidden information in the data).

In [2]:
Image(url= "../img/DS.png", width=400, height=1200)

source: https://jakevdp.github.io/PythonDataScienceHandbook/00.00-preface.html

This diagram basically explains where data science lies in the intersection of different disciplines. It comprises the skills of statistics (knowing how to summarize datasets), computer science (knowing how to model algorithms) an domain expertise (neccessary to formulate the research questions and put the answers into context).

### Where is Data Science needed?

It's used in many disciplines, such as science, banking, consulting, health care, e-commerce, politics, and manufacturing.


Some areas where data science is used:
* marketing (sales analysis)
* election predictions
* to predict the best time for deliveries
* to forsee delays in traffic
* etc

In the end, for our purposes, data science can help us better understand our data and provide some basic statistics as well as find out the relations inside the data.

Some of DS taks that are also useful for us:
* asking the right questions
* exploring and collecting data
* extracting data
* cleaning the data
* finding and replacing missing values
* normalizing data
* analyzing data, finding patterns and making predictions
* representing the results (visualisations)

### Data 

Types of data:
* unstructured (Tweets, novels, reviews)
* structured (databases, tables)

One purpose of Data Science is to structure data, making it interpretable and easy to work with.



### Data analysis in Python

Python is the "main language" of data science, as it has many inbuilt functions as well as libraries developed for the purposes of data science. 

Some of these libraries:
* [pandas](https://pandas.pydata.org/) - for structured data operations
* [numpy](https://numpy.org/) - a mathematical library (linear algebra, Fourier transform)
* [matplotlib](https://matplotlib.org/) - visualizations
* [SciPy](https://scipy.org/) - linear algebra models

### Data analysis and manipulation with Pandas

What can we find out with pandas?

* what is the correlation between two or more columns?
* what is the average value?
* what are the min and max values?

To install pandas - https://pandas.pydata.org/docs/getting_started/install.html

In [3]:
import pandas

Let's check our pandas version:

In [4]:
pandas.__version__

'1.4.4'

Usually, for pandas we use the abbreviation 'pd' so, we actually do our import such as:

In [5]:
import pandas as pd

Of course, for data science, data types are extremely important. We are already very familiar with the standard Python datatypes, such as strings, ints, lists, dictionaries, etc. In pandas, two important datatypes (or objects) are called Series and Dataframe.

### Pandas Series

Pandas series is kind of like a column in a table. It's actually a one-dimensional array that holds any type of data. We can create it out of a list, e.g.:

In [6]:
a = [1, 2, 3]
my_series = pd.Series(a)
my_series

0    1
1    2
2    3
dtype: int64

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.



In [7]:
my_series[0]

1

With the index argument, you can name your own labels.



In [8]:
my_series = pd.Series(a, index = ["x", "y", "z"])

my_series

x    1
y    2
z    3
dtype: int64

When you have created labels, you can access an item by referring to the label.



In [9]:
print(my_series['x'])

1


We can think about series as a type of a specialized dictionary. As we know, dictionaries have keys and values and Series also maps typed keys to sets of typed values.


In [10]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}



<div class="alert alert-block alert-info">
<b>Exercise 1</b>
    
<p>
    <li>Make a Series object out of Population_dict </li>
    <li>find out the population of California </li>
 
</p>
</div>

In [11]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [12]:
population['California']

38332521

We can also slice the data:

In [13]:
population['Texas':'Illinois']

Texas       26448193
New York    19651127
Florida     19552860
Illinois    12882135
dtype: int64

### Pandas Dataframes

Dataframes are structured representations of data, and unlike Series, often have more than one column.

 Let's start by making a dataframe with some numbers:

In [14]:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11]}

In [15]:
df = pd.DataFrame(data=d, index = ['number1', 'number2', 'number3', 'number4', 'number5'])
df

Unnamed: 0,col1,col2,col3
number1,1,4,7
number2,2,5,8
number3,3,6,12
number4,4,9,1
number5,7,5,11


We see that "col1", "col2" and "col3" are the names of the columns. The names of the indexes we added through a list.
They tell us the information about the position of the rows.

Now, we can use Python to count the columns and rows and ```df.shape``` to find the number of columns:



In [16]:
count_row = df.shape[0]

In [17]:
print(count_row)

5


To demonstrate this, let's first construct a new Series listing the area of each of the five states discussed in the previous section:



In [18]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}


Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:



In [19]:
states = pd.DataFrame({'population': population,
                       'area': area_dict})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


### Indexing 

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:



In [20]:
states['area']


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [21]:
states.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:



In [22]:
states['density'] = states['population'] / states['area']
states

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


With the index argument, you can name your own indexes.



As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:



In [23]:
states.values


array([[3.83325210e+07, 4.23967000e+05, 9.04139261e+01],
       [2.64481930e+07, 6.95662000e+05, 3.80187404e+01],
       [1.96511270e+07, 1.41297000e+05, 1.39076746e+02],
       [1.95528600e+07, 1.70312000e+05, 1.14806121e+02],
       [1.28821350e+07, 1.49995000e+05, 8.58837628e+01]])

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:



In [24]:
states.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
population,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
area,423967.0,695662.0,141297.0,170312.0,149995.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


Thus for array-style indexing, we need another convention. Here Pandas again uses the loc and iloc indexers mentioned earlier. Using the iloc indexer, we can index the underlying array, but the DataFrame index and column labels are maintained in the result.
Different from the loc method, iloc is integer position based. It only works with the inherent row position number ranged 0-n.



In [25]:
states.iloc[:3, :2]


Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297


In [26]:
states.loc[:'Florida', :'population']


Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860


Similarly, using the loc indexer we can index the underlying data in an array-like style but using the explicit index and column names:



In [27]:
print(states.loc['California'])


population    3.833252e+07
area          4.239670e+05
density       9.041393e+01
Name: California, dtype: float64


Direct masking operations are also interpreted row-wise rather than column-wise:



In [28]:
states[states.density > 100]


Unnamed: 0,population,area,density
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121


<div class="alert alert-block alert-info">
<b>Exercise 2</b>
    
<p>
    <li>Make a new dataframe out of exercise_data  </li>
    <li>Give the rows indexes as follows: 'day1', 'day2', 'day3' </li>
    <li>Return 'day3' using loc </li>
</p>
</div>

In [29]:
exercise_data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}


In [30]:
new=pd.DataFrame(data=exercise_data,index=['day1', 'day2', 'day3'] )
new

Unnamed: 0,calories,duration
day1,420,50
day2,380,40
day3,390,45


In [31]:
print(new.loc['day3'])

calories    390
duration     45
Name: day3, dtype: int64


### Creating a dataframe from csv

This is a functionality that we use a lot, for readin in our own datasets.

In [32]:
df = pd.read_csv('../data/koreanTV.csv')
df

Unnamed: 0,Title,Year,Rating,Votes:,Time,Genre,Stars,Short Story
0,Hellbound,(2021– ),6.7,14032,150 min,"Crime, Drama, Fantasy","Yoo Ah-in, Kim Hyun-joo, Jeong Min Park, Jin-a...",\nPeople hear predictions on when they will di...
1,Squid Game,(2021– ),8.1,339931,55 min,"Action, Drama, Mystery","Lee Jung-jae, Park Hae-soo, Wi Ha-Joon, Jung H...",\nHundreds of cash-strapped players accept a s...
2,My Name,(2021– ),7.9,12877,50 min,"Action, Crime, Drama","Hee-soon Park, Ahn Bo-Hyun, Han So-hee, Kim Sa...",\nThe story about a woman who joins an organiz...
3,Miraculous: Tales of Ladybug & Cat Noir,(2015– ),7.7,9439,20 min,"Animation, Action, Adventure","Cristina Valenzuela, Bryce Papenbrook, Keith S...","\nMarinette and Adrien, two normal teens, tran..."
4,Happiness,(2021– ),8.6,921,-,"Action, Fantasy, Thriller","Han Hyo-joo, Park Hyung-Sik, Woo-jin Jo, Lee J...",\nAn apocalyptic thriller that takes place in ...
...,...,...,...,...,...,...,...,...
1984,Untitled K-Pop Latin American Project,,-,-,-,Reality-TV,-,\nYoung men from Latin America audition for an...
1985,Bite Sisters,(2021),7.4,5,-,"Fantasy, Romance","Kang Han-na, Kim Yeong-Ah, Yu-hwa Choi, Lee Si...",\nThis story follows Han Yi Na a vampire who s...
1986,Adult Trainee,(2021– ),7.1,10,-,"Comedy, Romance","Mi-Yeon Cho, Ryu Eui-Hyun, Lee Chan Hyung, Yoo...",\nAdd a Plot\n
1987,A good supper,(2021),-,-,-,Romance,-,\nAdd a Plot\n


In [33]:
df.shape


(1989, 8)

We can also preview the DataFrame data using the head function. The head function display the first five rows of records in the Pandas DataFrame.



In [34]:
df.head()

Unnamed: 0,Title,Year,Rating,Votes:,Time,Genre,Stars,Short Story
0,Hellbound,(2021– ),6.7,14032,150 min,"Crime, Drama, Fantasy","Yoo Ah-in, Kim Hyun-joo, Jeong Min Park, Jin-a...",\nPeople hear predictions on when they will di...
1,Squid Game,(2021– ),8.1,339931,55 min,"Action, Drama, Mystery","Lee Jung-jae, Park Hae-soo, Wi Ha-Joon, Jung H...",\nHundreds of cash-strapped players accept a s...
2,My Name,(2021– ),7.9,12877,50 min,"Action, Crime, Drama","Hee-soon Park, Ahn Bo-Hyun, Han So-hee, Kim Sa...",\nThe story about a woman who joins an organiz...
3,Miraculous: Tales of Ladybug & Cat Noir,(2015– ),7.7,9439,20 min,"Animation, Action, Adventure","Cristina Valenzuela, Bryce Papenbrook, Keith S...","\nMarinette and Adrien, two normal teens, tran..."
4,Happiness,(2021– ),8.6,921,-,"Action, Fantasy, Thriller","Han Hyo-joo, Park Hyung-Sik, Woo-jin Jo, Lee J...",\nAn apocalyptic thriller that takes place in ...


On another hand, we can also use the tail function to display the last five rows of records.



In [35]:
df.tail()


Unnamed: 0,Title,Year,Rating,Votes:,Time,Genre,Stars,Short Story
1984,Untitled K-Pop Latin American Project,,-,-,-,Reality-TV,-,\nYoung men from Latin America audition for an...
1985,Bite Sisters,(2021),7.4,5,-,"Fantasy, Romance","Kang Han-na, Kim Yeong-Ah, Yu-hwa Choi, Lee Si...",\nThis story follows Han Yi Na a vampire who s...
1986,Adult Trainee,(2021– ),7.1,10,-,"Comedy, Romance","Mi-Yeon Cho, Ryu Eui-Hyun, Lee Chan Hyung, Yoo...",\nAdd a Plot\n
1987,A good supper,(2021),-,-,-,Romance,-,\nAdd a Plot\n
1988,User Not Found,(2021– ),-,-,-,"Drama, Romance",-,\nAdd a Plot\n


### Basic analysis of our data

Many often not all the data in a Pandas DataFrame are relevant to our work. Fortunately, Pandas DataFrame allow us to selectively extract one or more columns of data and work with them.



We can use a selected column name as the parameter to extract the data from that particular column.



In [36]:
df['Title'].head()


0                                  Hellbound
1                                 Squid Game
2                                    My Name
3    Miraculous: Tales of Ladybug & Cat Noir
4                                  Happiness
Name: Title, dtype: object

We can also select multiple columns by putting the target column names in a list.



In [37]:
df[['Title', 'Rating']].head()


Unnamed: 0,Title,Rating
0,Hellbound,6.7
1,Squid Game,8.1
2,My Name,7.9
3,Miraculous: Tales of Ladybug & Cat Noir,7.7
4,Happiness,8.6


If we really do not need a column we can always remove it. For that we use ```drop```

In [38]:
df = df.drop("Short Story", axis=1)
df.head()

Unnamed: 0,Title,Year,Rating,Votes:,Time,Genre,Stars
0,Hellbound,(2021– ),6.7,14032,150 min,"Crime, Drama, Fantasy","Yoo Ah-in, Kim Hyun-joo, Jeong Min Park, Jin-a..."
1,Squid Game,(2021– ),8.1,339931,55 min,"Action, Drama, Mystery","Lee Jung-jae, Park Hae-soo, Wi Ha-Joon, Jung H..."
2,My Name,(2021– ),7.9,12877,50 min,"Action, Crime, Drama","Hee-soon Park, Ahn Bo-Hyun, Han So-hee, Kim Sa..."
3,Miraculous: Tales of Ladybug & Cat Noir,(2015– ),7.7,9439,20 min,"Animation, Action, Adventure","Cristina Valenzuela, Bryce Papenbrook, Keith S..."
4,Happiness,(2021– ),8.6,921,-,"Action, Fantasy, Thriller","Han Hyo-joo, Park Hyung-Sik, Woo-jin Jo, Lee J..."


Use the drop function to remove the column “Short Story” from the DataFrame and shows the first five rows of records. The “axis=1” is to specify the current removal is for a column instead of a row.

Now let's use our rows to navigate over the dataframe. We will first try with loc method. The loc method allow us to select a row from a DataFrame using a index label. For example, we can define the movie title as our index label in the DataFrame and use it to extract our target rows of data.



In [39]:
df = df.set_index('Title')
df.head()


Unnamed: 0_level_0,Year,Rating,Votes:,Time,Genre,Stars
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Hellbound,(2021– ),6.7,14032,150 min,"Crime, Drama, Fantasy","Yoo Ah-in, Kim Hyun-joo, Jeong Min Park, Jin-a..."
Squid Game,(2021– ),8.1,339931,55 min,"Action, Drama, Mystery","Lee Jung-jae, Park Hae-soo, Wi Ha-Joon, Jung H..."
My Name,(2021– ),7.9,12877,50 min,"Action, Crime, Drama","Hee-soon Park, Ahn Bo-Hyun, Han So-hee, Kim Sa..."
Miraculous: Tales of Ladybug & Cat Noir,(2015– ),7.7,9439,20 min,"Animation, Action, Adventure","Cristina Valenzuela, Bryce Papenbrook, Keith S..."
Happiness,(2021– ),8.6,921,-,"Action, Fantasy, Thriller","Han Hyo-joo, Park Hyung-Sik, Woo-jin Jo, Lee J..."


In [40]:
df.shape

(1989, 6)

The “Title” is removed from the column list of the DataFrame and is marked as an index label now.
As we see by the number of the columns, we have removed one and set one as an index - now we have 6.

We can make use of this movie name as an index label to extract the data from the DataFrame by row.



In [41]:
df.loc['Squid Game']

Year                                               (2021– )
Rating                                                  8.1
Votes:                                              339,931
Time                                                 55 min
Genre                                Action, Drama, Mystery
Stars     Lee Jung-jae, Park Hae-soo, Wi Ha-Joon, Jung H...
Name: Squid Game, dtype: object

In [42]:
df.loc[['Squid Game', 'Happiness']]

Unnamed: 0_level_0,Year,Rating,Votes:,Time,Genre,Stars
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Squid Game,(2021– ),8.1,339931,55 min,"Action, Drama, Mystery","Lee Jung-jae, Park Hae-soo, Wi Ha-Joon, Jung H..."
Happiness,(2021– ),8.6,921,-,"Action, Fantasy, Thriller","Han Hyo-joo, Park Hyung-Sik, Woo-jin Jo, Lee J..."


The second approach to work with Pandas DataFrame by row is using the iloc method.


In [43]:
df.iloc[3]

Year                                               (2015– )
Rating                                                  7.7
Votes:                                                9,439
Time                                                 20 min
Genre                          Animation, Action, Adventure
Stars     Cristina Valenzuela, Bryce Papenbrook, Keith S...
Name: Miraculous: Tales of Ladybug & Cat Noir, dtype: object

In [44]:
df.iloc[2:5]

Unnamed: 0_level_0,Year,Rating,Votes:,Time,Genre,Stars
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
My Name,(2021– ),7.9,12877,50 min,"Action, Crime, Drama","Hee-soon Park, Ahn Bo-Hyun, Han So-hee, Kim Sa..."
Miraculous: Tales of Ladybug & Cat Noir,(2015– ),7.7,9439,20 min,"Animation, Action, Adventure","Cristina Valenzuela, Bryce Papenbrook, Keith S..."
Happiness,(2021– ),8.6,921,-,"Action, Fantasy, Thriller","Han Hyo-joo, Park Hyung-Sik, Woo-jin Jo, Lee J..."


Let's now remove a row. As said before, we can use a movie name as index label to remove a particular row. In this case, the row with movie name, “My Name” will be removed from the DataFrame.


In [45]:
df = df.drop('My Name', axis=0)
df.head()

Unnamed: 0_level_0,Year,Rating,Votes:,Time,Genre,Stars
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Hellbound,(2021– ),6.7,14032,150 min,"Crime, Drama, Fantasy","Yoo Ah-in, Kim Hyun-joo, Jeong Min Park, Jin-a..."
Squid Game,(2021– ),8.1,339931,55 min,"Action, Drama, Mystery","Lee Jung-jae, Park Hae-soo, Wi Ha-Joon, Jung H..."
Miraculous: Tales of Ladybug & Cat Noir,(2015– ),7.7,9439,20 min,"Animation, Action, Adventure","Cristina Valenzuela, Bryce Papenbrook, Keith S..."
Happiness,(2021– ),8.6,921,-,"Action, Fantasy, Thriller","Han Hyo-joo, Park Hyung-Sik, Woo-jin Jo, Lee J..."
Dr. Brain,(2021– ),6.9,1031,60 min,"Drama, Mystery, Sci-Fi","Sun-kyun Lee, June Yoon, Yoo-Young Lee, Hee-so..."


In [46]:
df.shape

(1988, 6)

<div class="alert alert-block alert-info">
<b>Exercise 3</b>
    
<p>
    <li>Import housing.csv into a dataframe  </li>
    <li>Preview the head of the data </li>
    <li>Remove the latitude column </li>
    <li>Set a condition to select data with the housing median age above 50</li>
</p>
</div>

### References and further reading:
    
 * Python Data Science Handbook - https://jakevdp.github.io/PythonDataScienceHandbook/
 * https://www.w3schools.com/python/pandas/default.asp
 * https://www.geeksforgeeks.org/data-science-tutorial/
 * data taken from - https://github.com/teobeeguan/Python-For-Machine-Learning/tree/main/Pandas
 * tutorial -  https://medium.com/ds-notes/learning-python-pandas-in-minutes-part-1-basics-f24463da1a18
 

In [47]:
df = pd.read_csv('../data/housing.csv')

df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [48]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [49]:
df = df.drop("latitude", axis=1)

df

Unnamed: 0,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...
20635,-121.09,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [50]:
df[df.housing_median_age > 50]

Unnamed: 0,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
2,-122.24,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY
6,-122.25,52.0,2535.0,489.0,1094.0,514.0,3.6591,299200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...
20142,-119.06,52.0,1239.0,320.0,934.0,298.0,1.8618,183300.0,<1H OCEAN
20220,-119.27,52.0,2239.0,420.0,941.0,397.0,4.1250,349000.0,NEAR OCEAN
20236,-119.27,52.0,459.0,112.0,276.0,107.0,2.3750,198400.0,NEAR OCEAN
20237,-119.27,52.0,1577.0,343.0,836.0,335.0,3.5893,206600.0,NEAR OCEAN
