# Examining Data Characteristics and Summary Statistics

It is always good practice to get a better understanding of your data in order to gather as many insights from it BEFORE doing any analyses. You want to, for example, understand the **characteristics** of your dataset.  That is, you want to make sure you know:
- what columns are reported in your data
- what the data types are for each column
- what the first few observations look like

Let's revisit the sample dataset that contains 1985 model import cars and the specification of each auto in terms of its various characteristics.  This data is stored as a CSV file on the Math@Work server.

In [9]:
import pandas as pd
autos = pd.read_csv('https://mathatwork.org/DATA/automobiles.csv')
print(autos.head(3))
print(autos.tail(3))
print(autos.dtypes)

          make fuel-type aspiration num-of-doors   body-style drive-wheels  \
0  alfa-romero       gas        std          two  convertible          rwd   
1  alfa-romero       gas        std          two  convertible          rwd   
2  alfa-romero       gas        std          two    hatchback          rwd   

  engine-location  wheel-base  length  width  height  curb-weight engine-type  \
0           front        88.6   168.8   64.1    48.8         2548        dohc   
1           front        88.6   168.8   64.1    48.8         2548        dohc   
2           front        94.5   171.2   65.5    52.4         2823        ohcv   

  num-of-cylinders  engine-size fuel-system  compression-ratio  city-mpg  \
0             four          130        mpfi                9.0        21   
1             four          130        mpfi                9.0        21   
2              six          152        mpfi                9.0        19   

   highway-mpg    price  
0           27  13495.0  
1    

Upon reviewing the dataset head, tail, and column data types, you can see that the data looks reasonably good.  At this point, there is no indication of inappropriate column data types.  The data types are consistent with the data that is reported in each respective column.

Another good way to understand your data is by calculating various **summary statistics**.  Summary statistics can give you very useful information like where your data is centered and how spread out it is.  Now, take a look at the various summary statistics of the automobiles dataset.

In [10]:
summary = autos.describe()
print(summary)

       wheel-base      length       width      height  curb-weight  \
count  205.000000  205.000000  205.000000  205.000000   205.000000   
mean    98.756585  174.049268   65.907805   53.724878  2555.565854   
std      6.021776   12.337289    2.145204    2.443522   520.680204   
min     86.600000  141.100000   60.300000   47.800000  1488.000000   
25%     94.500000  166.300000   64.100000   52.000000  2145.000000   
50%     97.000000  173.200000   65.500000   54.100000  2414.000000   
75%    102.400000  183.100000   66.900000   55.500000  2935.000000   
max    120.900000  208.100000   72.300000   59.800000  4066.000000   

       engine-size  compression-ratio    city-mpg  highway-mpg         price  
count   205.000000         205.000000  205.000000   205.000000    201.000000  
mean    126.907317          10.142537   25.219512    30.751220  13207.129353  
std      41.642693           3.972040    6.542142     6.886443   7947.066342  
min      61.000000           7.000000   13.000000    

Pandas **.describe( )** generates descriptive statistics that summarize the central tendency (mean), dispersion(std, quartiles, min and max) and shape of the dataset's distribution, excluding NaN values.  Review the Python for Data Science workshop for additional details on these statistical properties.

Notice that the summary statistics was saved into a DataFrame named *summary*. We can now loop through this DataFrame to calculate the relative magnitude of the standard deviation for each data column.

In [18]:
for y in summary:
    rel_mag = round(summary[y]['std']/summary[y]['mean']*100,0)
    print(y,':  is',rel_mag,'% of the mean')

wheel-base :  is 6.0 % of the mean
length :  is 7.0 % of the mean
width :  is 3.0 % of the mean
height :  is 5.0 % of the mean
curb-weight :  is 20.0 % of the mean
engine-size :  is 33.0 % of the mean
compression-ratio :  is 39.0 % of the mean
city-mpg :  is 26.0 % of the mean
highway-mpg :  is 22.0 % of the mean
price :  is 60.0 % of the mean


This indicates that *wheel-base*, *length*, *width*, *height*, *curb-weight*, *engine-size*, *city-mpg* and *highway-mpg* all have fairly low variability (or spread) while *compression-ratio* and *price* have fairly high variability.
<br><br>
Since there is no indication of inappropriate column data types and working on the assumption that the data has been preprocessed to include no missing values or bad data entries (done in the previous lessons), we can conclude that at this point the data is of good quality for analyses.

### Exercise

Recall the sample dataset that contains 150 observations of iris plants and their various characteristics.  This data is stored as a CSV file on the Math@Work server.

In [19]:
iris = pd.read_csv('https://mathatwork.org/DATA/iris.csv')

**1)** In the next cell, examine the characteristics of your dataset.

**2)** In the next cell, examine summary statistics of your dataset. Save your summary statistics into a DataFrame named *summary*.

Run the following code to loop through the *summary* DataFrame to calculate the relative magnitude of the standard deviation for each data column.  In the cell below the output, discuss the spread of your data.

In [None]:
for y in summary:
    rel_mag = round(summary[y]['std']/summary[y]['mean']*100,0)
    print(y,':  is',rel_mag,'% of the mean')

**3)** Working on the assumption that the data has been preprocessed to include no missing values or bad data entries, can you conclude that the data is of good quality for analyses? Explain.