# Standard practice to nickname it pd so that it's faster to type later on.

In [1]:
import pandas as pd

# We will be working with a dataset of Titanic passengers. For each passenger, we’ll have some data on them as well as whether or not they survived the crash.

In [2]:
df = pd.read_csv('https://sololearn.com/uploads/files/titanic.csv') # read_csv function takes a file in csv format and converts it to a Pandas DataFrame

In [3]:
df.head() # The head method returns the first 5 rows of the DataFrame.

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses,Parents/Children,Fare
0,0,3,male,22.0,1,0,7.25
1,1,1,female,38.0,1,0,71.2833
2,1,3,female,26.0,0,0,7.925
3,1,1,female,35.0,1,0,53.1
4,0,3,male,35.0,0,0,8.05


Usually our data is much too big for us to be able to display it all.
Looking at the first few rows is the first step to understanding our data, but then we want to look at some summary statistics.
In pandas, we can use the describe method. It returns a table of statistics about the columns.

In [5]:
df.describe() #  returns a table of statistics about the columns.

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses,Parents/Children,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


We add a line in the code below to force python to display all 6 columns. Without the line, it will abbreviate the results

In [8]:
pd.options.display.max_columns = 6 #  forces python to display all 6 columns
df.describe() # We use the Pandas describe method to start building some intuition about our data.

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses,Parents/Children,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


Count: This is the number of rows that have a value. In our case, every passenger has a value for each of the columns, so the value is 887 (the total number of passengers).


Mean: Recall that the mean is the standard average.



Std: This is short for standard deviation. This is a measure of how dispersed the data is.


Min: The smallest value


25%: The 25th percentile


50%: The 50th percentile, also known as the median.


75%: The 75th percentile


Max: The largest value

To select a single column, we use the square brackets and the column name

In [10]:
print(df['Fare'])

0       7.2500
1      71.2833
2       7.9250
3      53.1000
4       8.0500
        ...   
882    13.0000
883    30.0000
884    23.4500
885    30.0000
886     7.7500
Name: Fare, Length: 887, dtype: float64


The result is what we call a Pandas Series.
A series is like a DataFrame, but it's just a single column.

We can also select multiple columns from our original DataFrame, creating a smaller DataFrame.
We're going to select just the Age, Sex, and Survived columns from our original DataFrame.

In [11]:
small_df = df[['Age', 'Fare', 'Survived']]

In [12]:
print(small_df)

      Age     Fare  Survived
0    22.0   7.2500         0
1    38.0  71.2833         1
2    26.0   7.9250         1
3    35.0  53.1000         1
4    35.0   8.0500         0
..    ...      ...       ...
882  27.0  13.0000         0
883  19.0  30.0000         1
884   7.0  23.4500         0
885  26.0  30.0000         1
886  32.0   7.7500         0

[887 rows x 3 columns]


We may put these values in a list as follows:

In [13]:
columns = ['Age', 'Fare', 'Survived']

In [14]:
list_small_df = df[columns]
print(list_small_df) # use that list inside of the bracket notation df[...]

      Age     Fare  Survived
0    22.0   7.2500         0
1    38.0  71.2833         1
2    26.0   7.9250         1
3    35.0  53.1000         1
4    35.0   8.0500         0
..    ...      ...       ...
882  27.0  13.0000         0
883  19.0  30.0000         1
884   7.0  23.4500         0
885  26.0  30.0000         1
886  32.0   7.7500         0

[887 rows x 3 columns]


In [15]:
list_small_df.head() # use the head method to print just the first 5 rows.

Unnamed: 0,Age,Fare,Survived
0,22.0,7.25,0
1,38.0,71.2833,1
2,26.0,7.925,1
3,35.0,53.1,1
4,35.0,8.05,0


We often want our data in a slightly different format than it originally comes in. For example, our data has the sex of the passenger as a string ("male" or "female"). This is easy for a human to read, but when we do computations on our data later on, we’ll want it as boolean values (Trues and Falses).

We can easily create a new column in our DataFrame that is True if the passenger is male and False if they’re female.

In [17]:
df['Sex'] == 'male'

0       True
1      False
2      False
3      False
4       True
       ...  
882     True
883    False
884    False
885     True
886     True
Name: Sex, Length: 887, dtype: bool

In [18]:
df['male'] = df['Sex'] == 'male' # To create a new column, we use the same bracket syntax (df['male']) and then assign this 
                                #  new value to it.

In [21]:
pd.options.display.max_columns = 8
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses,Parents/Children,Fare,male
0,0,3,male,22.0,1,0,7.25,True
1,1,1,female,38.0,1,0,71.2833,False
2,1,3,female,26.0,0,0,7.925,False
3,1,1,female,35.0,1,0,53.1,False
4,0,3,male,35.0,0,0,8.05,True
