# Example (synthetic) data sets
The purpose of this effort to create a schema for weather data related to one location, Lamesa, TX, mentioned in the Legacy Cotton Dataset. 

In this chapter we will inspect the provided dataset and prepare it to be converted into linked data. As described earlier, linked data is data that follows a linked-data format consisting of triples where the individual parts consists of either values or identifiers (IRIs) pointing to concept definitions. 

The proces of converting data to linked data is rather straightforward and can be summed up as follows:
1. Inspect the data
2. Clean the data headers
3. Align the data with common controlled vocabularies or ontologies
4. Design linked-data shapes using the IRIs from the controlled vocabularies and ontologies
5. Transform the data into RDF

In this chapter we will focus on the first two steps.

## Dataset on Weather
The first dataset is a  csv file (Dawson3242890.csv).
It contains weather data for Lamesa, TX for March1- October 30, 2010.

In [1]:
import pandas as pd
weather = pd.read_csv("data/Dawson3242890.csv",index_col=0)
weather

Unnamed: 0_level_0,NAME,DATE,DAPR,MDPR,PRCP,SNOW,SNWD,TMAX,TMIN,TOBS,WT01,WT11
STATION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
USC00415013,"LAMESA 1 SSE, TX US",2010-03-01,,,0.44,0.0,0.0,60,36.0,36.0,,
USC00415013,"LAMESA 1 SSE, TX US",2010-03-02,,,0.02,0.0,0.0,41,27.0,27.0,,
USC00415013,"LAMESA 1 SSE, TX US",2010-03-03,,,0.00,0.0,0.0,59,27.0,33.0,,
USC00415013,"LAMESA 1 SSE, TX US",2010-03-04,,,0.00,0.0,0.0,68,33.0,35.0,,
USC00415013,"LAMESA 1 SSE, TX US",2010-03-05,,,0.00,0.0,0.0,71,33.0,43.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...
USC00415013,"LAMESA 1 SSE, TX US",2010-10-27,,,0.00,0.0,0.0,74,40.0,47.0,,
USC00415013,"LAMESA 1 SSE, TX US",2010-10-28,,,0.00,0.0,0.0,76,40.0,40.0,,
USC00415013,"LAMESA 1 SSE, TX US",2010-10-29,,,0.00,0.0,0.0,67,32.0,32.0,,
USC00415013,"LAMESA 1 SSE, TX US",2010-10-30,,,0.00,0.0,0.0,73,32.0,45.0,,


In [2]:
weather.info()

<class 'pandas.core.frame.DataFrame'>
Index: 245 entries, USC00415013 to USC00415013
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   NAME    245 non-null    object 
 1   DATE    245 non-null    object 
 2   DAPR    0 non-null      float64
 3   MDPR    0 non-null      float64
 4   PRCP    245 non-null    float64
 5   SNOW    245 non-null    float64
 6   SNWD    245 non-null    float64
 7   TMAX    245 non-null    int64  
 8   TMIN    244 non-null    float64
 9   TOBS    241 non-null    float64
 10  WT01    0 non-null      float64
 11  WT11    0 non-null      float64
dtypes: float64(9), int64(1), object(2)
memory usage: 24.9+ KB


This dataset contains 11 fields of various datatypes. In one of the next steps this field names will be alligned with various controlled vocabularies or ontologies.

In this chapter we have reviewed and possibly cleaned the source data. In the next chapter these terms will be use to identify IRIs that unambiguously point to the definitions of these field labels. In this stage of the project we need to be a bit creative here. Some crucial information such as conditionals or units have been removed, but these are needed in the semantic models that will be derived.

Moving forward the project should design a common tabular format that, next to field labels, also captures these conditionals, units and cardinality. 

Eventually this book should contain a chapter that describes this tabular format. 

Eventually the steps described in this chapter will be redundant. Moving forward performers in the different PDAs, ideally will build on a predefined tabular format where the field names are selected from the provided codebook. 

## NALT shapes and  codebook
The DT codebook will be a listing of selected field names. Data curators will be able to select field names from this codebook. Non-existing field names can be requested. 
