# Looking at DataFrame Data

1. Run the cell below to import required libraries and create a DataFrame

In [1]:
import pandas as pd
import numpy as np
import random

num_rows = 100
colors = ['Red', 'Blue', 'Green']

df = pd.DataFrame( {'color': [colors[random.randint(0,2)] for _ in range(num_rows)],
                    'integers': [random.randint(0,15) for _ in range(num_rows)],
                    'floats': [random.random() for _ in range(num_rows)]})
df

Unnamed: 0,color,integers,floats
0,Green,2,0.599703
1,Blue,6,0.265175
2,Green,12,0.991561
3,Green,10,0.249424
4,Red,15,0.880343
...,...,...,...
95,Blue,5,0.946903
96,Green,3,0.839961
97,Red,2,0.221419
98,Red,9,0.761552


2. Use the DataFrame `head()` method to view the top five rows. Try giving it a number as an argument to control how many rows are displayed.

In [2]:
df.head(3)

Unnamed: 0,color,integers,floats
0,Green,2,0.599703
1,Blue,6,0.265175
2,Green,12,0.991561


In [4]:
df.tail(3)

Unnamed: 0,color,integers,floats
97,Red,2,0.221419
98,Red,9,0.761552
99,Blue,9,0.708124


3. View summary statistics using the DataFrame `describe()` method.

In [5]:
df.describe()

Unnamed: 0,integers,floats
count,100.0,100.0
mean,8.03,0.520552
std,4.529265,0.284661
min,0.0,0.001942
25%,4.0,0.265119
50%,8.0,0.502868
75%,12.0,0.771623
max,15.0,0.991561


In [6]:
df.min()

color             Blue
integers             0
floats      0.00194239
dtype: object

In [7]:
df.std()

integers    4.529265
floats      0.284661
dtype: float64

4. The `decribe()` method accepts some optional arguments, including 'include' and 'exclude'. By default, `describe()` only shows statistics for columns with numerical data, but if you add the argument `include=np.object`, it will display statistics for columns with string data. Try this.

In [8]:
df.describe(include=np.object)

Unnamed: 0,color
count,100
unique,3
top,Blue
freq,38


5. If you change the argument to `include='all'`, it will display statistics for all columns in the data frame, inserting `NaN` (not a number) when the data type is not appropriate for the statistic. Try viewing statistics for all frames using `describe()`.

In [9]:
df.describe(include='all')

Unnamed: 0,color,integers,floats
count,100,100.0,100.0
unique,3,,
top,Blue,,
freq,38,,
mean,,8.03,0.520552
std,,4.529265,0.284661
min,,0.0,0.001942
25%,,4.0,0.265119
50%,,8.0,0.502868
75%,,12.0,0.771623


## Selecting Data
6. You can select a column using bracket syntax very similar to that used with dictionaries. Put the column name, as a string, in brackets after the DataFrame name. Try this with the column 'color'

In [14]:
df['color']

0     Green
1      Blue
2     Green
3     Green
4       Red
      ...  
95     Blue
96    Green
97      Red
98      Red
99     Blue
Name: color, Length: 100, dtype: object

7. Try selecting the columns 'color' and 'floats' by supplying them as a list of strings in the same bracket syntax.

In [17]:
df[['color', 'floats']]

Unnamed: 0,color,floats
0,Green,0.599703
1,Blue,0.265175
2,Green,0.991561
3,Green,0.249424
4,Red,0.880343
...,...,...
95,Blue,0.946903
96,Green,0.839961
97,Red,0.221419
98,Red,0.761552


8. The bracket syntax in DataFrames is overloaded to select rows as well. Selecting rows uses the syntax we used to select slices in Sequences: a start number, a colon, and an upper bound number. Try selecting three rows from the DataFrame using the slice `10:13`

In [19]:
df.iloc[10:13]

Unnamed: 0,color,integers,floats
10,Blue,0,0.809239
11,Blue,5,0.851702
12,Red,9,0.165462


9. Now let's try the `.loc[]` syntax. It also uses bracket syntax, but in this case you will specify both rows and columns to select. Select all of the rows by supplying a lone colon as the first argument, and the column 'color' by supplying it as a second argument (remember that arguments must be separted by a comma).

In [28]:
df.loc[:,'color']

0     Green
1      Blue
2     Green
3     Green
4       Red
      ...  
95     Blue
96    Green
97      Red
98      Red
99     Blue
Name: color, Length: 100, dtype: object

10. Now specify a slice, `10:13`, for the first argument and a list of columns, `['color', 'integers']`, as a second, to select **four** rows (the upper bound in `loc[]` is included) and two columns.

In [29]:
df.loc[10:13],['color','integers']

(    color  integers    floats
 10   Blue         0  0.809239
 11   Blue         5  0.851702
 12    Red         9  0.165462
 13  Green         1  0.153705,
 ['color', 'integers'])