## Airfoil self-noise example data set

We downloaded the data set as a `.dat` file from the UCI machine learning respository.

## Import modules

In [1]:
import pandas as pd

## Read in the data

First, read in the data set from the local file.

In [2]:
file_name = 'airfoil_self_noise_from_uci.dat'

In [3]:
%whos

Variable    Type      Data/Info
-------------------------------
file_name   str       airfoil_self_noise_from_uci.dat
pd          module    <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>


In [4]:
a_file = 'airfoil_noise_example.ipynb'

In [5]:
df_1 = pd.read_csv( file_name, header=None, sep='\t')

In [6]:
%whos

Variable    Type         Data/Info
----------------------------------
a_file      str          airfoil_noise_example.ipynb
df_1        DataFrame             0     1       2 <...>\n[1503 rows x 6 columns]
file_name   str          airfoil_self_noise_from_uci.dat
pd          module       <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>


In [7]:
print( df_1.shape )

(1503, 6)


In [9]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1503 entries, 0 to 1502
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1503 non-null   int64  
 1   1       1503 non-null   float64
 2   2       1503 non-null   float64
 3   3       1503 non-null   float64
 4   4       1503 non-null   float64
 5   5       1503 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 70.6 KB


In [10]:
df_1.head()

Unnamed: 0,0,1,2,3,4,5
0,800,0.0,0.3048,71.3,0.002663,126.201
1,1000,0.0,0.3048,71.3,0.002663,125.201
2,1250,0.0,0.3048,71.3,0.002663,125.951
3,1600,0.0,0.3048,71.3,0.002663,127.591
4,2000,0.0,0.3048,71.3,0.002663,127.461


Now read in the data set again, but this time specify the variable names. Also, we will read in the data directly from the website rather than from the local file.

In [11]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat'

Define a list which stores the variable names.

In [12]:
var_names = ['frequency', 'aoa', 'chord', 'velocity', 'displacement', 'decibels']

In [13]:
%whos

Variable    Type         Data/Info
----------------------------------
a_file      str          airfoil_noise_example.ipynb
data_url    str          https://archive.ics.uci.e<...>91/airfoil_self_noise.dat
df_1        DataFrame             0     1       2 <...>\n[1503 rows x 6 columns]
file_name   str          airfoil_self_noise_from_uci.dat
pd          module       <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
var_names   list         n=6


In [14]:
airfoil_df = pd.read_csv( data_url, header=None, names=var_names, sep='\t' )

In [15]:
print( airfoil_df.shape )

(1503, 6)


In [16]:
airfoil_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1503 entries, 0 to 1502
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   frequency     1503 non-null   int64  
 1   aoa           1503 non-null   float64
 2   chord         1503 non-null   float64
 3   velocity      1503 non-null   float64
 4   displacement  1503 non-null   float64
 5   decibels      1503 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 70.6 KB


In [17]:
airfoil_df.head()

Unnamed: 0,frequency,aoa,chord,velocity,displacement,decibels
0,800,0.0,0.3048,71.3,0.002663,126.201
1,1000,0.0,0.3048,71.3,0.002663,125.201
2,1250,0.0,0.3048,71.3,0.002663,125.951
3,1600,0.0,0.3048,71.3,0.002663,127.591
4,2000,0.0,0.3048,71.3,0.002663,127.461


## Number of unique values

First, double check the number of missing observations for each column (variable).

In [18]:
print( airfoil_df.isna().sum() )

frequency       0
aoa             0
chord           0
velocity        0
displacement    0
decibels        0
dtype: int64


But how many unique values for each variable?

In [19]:
airfoil_df.nunique()

frequency         21
aoa               27
chord              6
velocity           4
displacement     105
decibels        1456
dtype: int64

In [20]:
airfoil_df.velocity.value_counts()

39.6    480
71.3    465
31.7    281
55.5    277
Name: velocity, dtype: int64

## Reshaping from wide to long format

In [21]:
airfoil_lf = airfoil_df.reset_index().melt( id_vars = ['index'], value_vars = var_names )

In [22]:
airfoil_lf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9018 entries, 0 to 9017
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   index     9018 non-null   int64  
 1   variable  9018 non-null   object 
 2   value     9018 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 211.5+ KB


In [23]:
airfoil_lf.head()

Unnamed: 0,index,variable,value
0,0,frequency,800.0
1,1,frequency,1000.0
2,2,frequency,1250.0
3,3,frequency,1600.0
4,4,frequency,2000.0


In [24]:
airfoil_lf.nunique()

index       1503
variable       6
value       1619
dtype: int64

Grouping and aggregating.

In [25]:
airfoil_lf.groupby(['variable']).size().reset_index(name='num_rows')

Unnamed: 0,variable,num_rows
0,aoa,1503
1,chord,1503
2,decibels,1503
3,displacement,1503
4,frequency,1503
5,velocity,1503
