# Normalizing Log ASCII Standard (LAS) Files: A Walkthrough

We aim to normalize unstructured LAS files, commonly used in the oil and gas industry. LAS files can be complex and challenging due to inconsistent naming conventions and varied units. This notebook is part 1 of our data science approach to handle these complexities.


## Step 1: Setting up the environment

We start by ensuring that the requests library is installed. This library will allow us to send HTTP requests for downloading the required files.

In [1]:
#! pip install requests

### Here, we import the necessary libraries for our project:

    pandas -> for data manipulation and analysis,
    requests -> for sending HTTP requests,
    zipfile -> for extracting ZIP files, and
    io -> to handle file streams.

In [2]:
import pandas as pd
import requests
import zipfile
import io

## Step 2: Downloading the LAS files

This cell sets up a flag to control whether to re-download the dataset or not. This is useful for controlling data usage and avoiding unnecessary downloads.

In [3]:
redownload = False

if redownload:
    url = "https://www.kgs.ku.edu/PRS/Ora_Archive/ks_las_files.zip"
    response = requests.get(url)
    zip_file = io.BytesIO(response.content)

    with zipfile.ZipFile(zip_file) as z:
        z.extractall()

## Step 3: Loading the data

In [4]:
las_file_df = pd.read_csv('ks_las_files.txt')


In [5]:
las_file_df.head()

Unnamed: 0,KGS_ID,Latitude,Longitude,Location,Operator,Lease,API,Elevation,Elev_Ref,Depth_start,Depth_stop,URL
0,1028187622,39.98344,-97.199375,"T1S R2E, Sec. 10, NW SW NW",KANSAS GEOLOGICAL SURVEY,W. GAYDUSEK II 1,,1599.0,KB,50.5,525.0,http://www.kgs.ku.edu/WellLogs/01S02E/10200690...
1,1044172351,39.943543,-95.936294,"T1S R13E, Sec. 23, W2 SE SW SW",Kinney Oil Company,Baumgartner 1-23 1,15-131-20234,1215.0,KB,1814.2,2096.2,http://www.kgs.ku.edu/WellLogs/kcc_logs_2015/1...
2,1044172351,39.943543,-95.936294,"T1S R13E, Sec. 23, W2 SE SW SW",Kinney Oil Company,Baumgartner 1-23 1,15-131-20234,1215.0,KB,240.0,2096.4,http://www.kgs.ku.edu/WellLogs/kcc_logs_2015/1...
3,1044022696,39.992205,-95.799079,"T1S R14E, Sec. 1, SW NE NE SW",Wolf Operating LLC,Stalder-Adams 1-1,15-131-20225,1130.0,KB,185.2,3663.2,http://www.kgs.ku.edu/WellLogs/kcc_logs_2014/1...
4,1044022696,39.992205,-95.799079,"T1S R14E, Sec. 1, SW NE NE SW",Wolf Operating LLC,Stalder-Adams 1-1,15-131-20225,1130.0,KB,3319.6,3663.6,http://www.kgs.ku.edu/WellLogs/kcc_logs_2014/1...


## Step 4: Normalizing the List

1. We split the 'Location' column from the DataFrame into individual components. 

2. Then we clean up the data by replacing comma characters with nothing, and combine the 'Township' and 'Range' columns to form a new 'Township-Range' column. 

3. We merge this cleaned and transformed data back into the original DataFrame and drop the original 'Location' column.

In [6]:
location_df = las_file_df.Location.str.split(' ', expand = True)
location_df.columns = ['Township', 'Range', '', 'Section', '',  "QC4", "QC3", "QC2", "QC1", '']
columns_to_keep = ['Township', 'Range', 'Section', "QC4", "QC3", "QC2", "QC1"]
location_df = location_df[columns_to_keep]
location_df = location_df.replace(',', '', regex=True)
location_df['Township-Range'] = location_df['Township'] + '-' + location_df['Range']
las_file_df = las_file_df.merge(location_df, left_index=True, right_index=True)
las_file_df = las_file_df.drop('Location', axis = 'columns')


## Step 5: Focusing on a specific Township and Range

Due to the large size of the dataset, we filter it down to a specific 'Township-Range' for a manageable and focused analysis.



In [7]:
location_df['Township-Range'].value_counts()

T20S-R20E    267
T32S-R12W    231
T31S-R1W     197
T14S-R32W    160
T11S-R17W    160
            ... 
T13S-R10W      1
T13S-R9W       1
T12S-R41W      1
T12S-R40W      1
T15S-R39W      1
Name: Township-Range, Length: 1618, dtype: int64

In [8]:
tr_mask_value = location_df['Township-Range'].value_counts().idxmax()
tr_mask_value

'T20S-R20E'

In [9]:
mask = las_file_df['Township-Range'] == tr_mask_value

In [10]:
target_df = las_file_df[mask]

## Step 6: Downloading the filtered data

We define a function to download a ZIP file from a specified URL and extract it into a specified location.

If the redownload flag is set to True, we loop through each row of the filtered DataFrame, construct the URL and filename for the ZIP file, and then download and extract it.

In [11]:
def get_zip(url, filename):
    response = requests.get(url) 
    zip_file = io.BytesIO(response.content)

    with zipfile.ZipFile(zip_file) as z:
        z.extractall(filename)

In [12]:
if redownload:
    for idx, row in target_df.iterrows():
        folder = row.KGS_ID
        url = row.URL
        zip_num = url.split('/')[-1].strip('.zip')
        filename = f'logs/{folder}/{zip_num}/'
        get_zip(url, filename)


This is the end of Part 1 of our walkthrough, where we load, normalize and filter the data. 

In subsequent parts, we will continue with depth adjustment, splicing, and outputting the logs in a format ready for geological workstation software.