Celem projektu jest za pomocą pyspark, wytrenować model na jak największym zbiorze danych, który na podstawie tekstu z logów, będzie je etykietował.

In [None]:
import re

In [None]:
# pip install numpy scipy


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace, split, lower, explode, lit
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer
from pyspark.sql.functions import regexp_extract, col
from pyspark.sql.functions import concat_ws, to_timestamp
from pyspark.ml.classification import LogisticRegressionModel
from pyspark.ml.feature import IndexToString
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, ArrayType, DoubleType, LongType
from pyspark.ml.linalg import VectorUDT
from pyspark.sql import Row
from pyspark.ml.evaluation import BinaryClassificationEvaluator


In [None]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Model

Wczytanie danych i zapisanie ich do tabeli. Za pomocą wyrażeń regularny ciąg tekstu został rozbity i zapisany do odpowiedich kolumn

In [None]:
spark = SparkSession.builder \
    .appName("Log Classification") \
    .config("spark.driver.memory", "14g") \
    .config("spark.executor.memory", "14g") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/28 14:20:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [None]:
logs_rdd = spark.read.text("issue_1/applogcat.log")
log_pattern = r"(\d{1,2}-\d{1,2}) (\d{2}:\d{2}:\d{2}\.\d{3})\s+(\d+)\s+(\d+)\s+([A-Z])\s+([a-zA-Z0-9_]+):\s+(.*)"


Łącznie w w tabeli znajduje się półtora miliona rekordów.Projekt ogranicza się do takiej ilości, przez wzgląd na ograniczenia sprzętowe. Przy uczeniu modelu z większą ilością, pyspark ulegał awarii, prawdopodobnie przez nie wsytarczającą ilość pamięci RAM

In [None]:
print(logs_rdd.count())

[Stage 0:>                                                        (0 + 12) / 12]

1555005


                                                                                

In [None]:
logs_df = logs_rdd.select(
    regexp_extract('value', log_pattern, 1).alias('date'),
    regexp_extract('value', log_pattern, 2).alias('time'),
    regexp_extract('value', log_pattern, 3).alias('pid'),
    regexp_extract('value', log_pattern, 4).alias('tid'),
    regexp_extract('value', log_pattern, 5).alias('level'),
    regexp_extract('value', log_pattern, 6).alias('source'),
    regexp_extract('value', log_pattern, 7).alias('message')
)

In [None]:
logs_df = logs_df.withColumn(
    "timestamp", to_timestamp(concat_ws(" ", col("date"), col("time")), "MM-dd HH:mm:ss.SSS")
).drop("date", "time")

EKSPLORACJA DANYCH

In [None]:
logs_df.show(truncate=False)

+----+----+-----+-----------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------+
|pid |tid |level|source                 |message                                                                                                                                                                                                                                                                                                                                                                                                              

W kolumnie timestamp znajduje się dużo wartości null, jednocześnie nie wydaje się ona być istota w procesie uczenia masyznowego, w związku z czym zostaje ona zdropowana

In [None]:
null_count = logs_df.filter(col("timestamp").isNull()).count()
print('timestamp nulls:',null_count)
null_count = logs_df.filter(col("pid").isNull()).count()
print('pid nulls:',null_count)
null_count = logs_df.filter(col("tid").isNull()).count()
print('tid nulls:',null_count)
null_count = logs_df.filter(col("level").isNull()).count()
print('level nulls:',null_count)
null_count = logs_df.filter(col("message").isNull()).count()
print('message nulls:',null_count)



timestamp nulls: 419009




pid nulls: 0


[Stage 22:>                                                       (0 + 12) / 12]                                                                                

tid nulls: 0


[Stage 25:====>                                                   (1 + 11) / 12]                                                                                

level nulls: 0
message nulls: 0




In [None]:
logs_df = logs_df.drop('timestamp')

In [None]:
logs_df.groupBy("level").count().show()



+-----+------+
|level| count|
+-----+------+
|    E| 64769|
|    V| 38204|
|    D|444834|
|    W| 72910|
|    I|515276|
|     |419009|
|    F|     3|
+-----+------+



                                                                                

In [None]:
logs_df.groupBy("source").count().orderBy("count", ascending=False).show()

[Stage 7:>                                                        (0 + 12) / 12]

+--------------------+------+
|              source| count|
+--------------------+------+
|                    |419009|
| PowerManagerService| 75836|
|HwCustMobileSigna...| 55644|
|      wpa_supplicant| 53279|
| HwSignalClusterView| 38672|
|    UsbDeviceManager| 28579|
|SendBroadcastPerm...| 24563|
|     ActivityManager| 23781|
|     HwSystemManager| 20706|
|        NetWorkUtils| 19274|
|HwActivityManager...| 18200|
|libfingersense_wr...| 17832|
|HwMobileSignalCon...| 17820|
|            chromium| 16968|
|StackScrollAlgorithm| 14989|
|HwAmbientLuxFilte...| 14712|
|   NetworkManagement| 14271|
|PhoneInterfaceMan...| 14041|
|          HwLauncher| 13724|
|ActivityManager_b...| 13623|
+--------------------+------+
only showing top 20 rows



                                                                                

In [None]:
logs_df.groupBy("pid").count().show()

+-----+-----+
|  pid|count|
+-----+-----+
| 3858|  422|
| 4690| 1265|
|22049|   14|
|22148|   32|
|  836|  656|
|20818|   23|
| 3466| 1292|
|21569|   32|
|20894|  202|
| 3879|16186|
| 7194| 7994|
|  597|  783|
|21081|   23|
|  633|11408|
|16914|   43|
|  523|   19|
| 2835|   43|
|18142|  132|
|22921|   30|
| 1046|21441|
+-----+-----+
only showing top 20 rows



[Stage 10:====>                                                   (1 + 11) / 12]                                                                                

In [None]:
logs_df.groupBy("tid").count().show()



+-----+-----+
|  tid|count|
+-----+-----+
|20219|    1|
|21331|   53|
| 1572|   40|
| 3517| 4308|
| 1808| 5372|
|20626|    5|
| 7208| 4605|
|23054|    5|
| 3249| 5213|
| 3121| 5115|
|21556|    1|
|21833|   11|
| 2895|  380|
|22920|    7|
|20183|    5|
|20881|    4|
|22652|    3|
| 4690| 1247|
|22049|   14|
|20387|    4|
+-----+-----+
only showing top 20 rows



                                                                                

Słowa z wiadomości logów zostają ztokenizowane, a następnie usyzskuja reprezentację liczbową w wektorze. Dodatkowo usunięte zostają niepoterzebne stopwordy np. THE, A, AN, AS.

In [None]:
tokenizer = Tokenizer(inputCol="message", outputCol="words")
tokenized_df = tokenizer.transform(logs_df)

In [None]:
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")
processed_df = remover.transform(tokenized_df)

In [None]:
processed_df.show()

+----+----+-----+--------------------+--------------------+--------------------+--------------------+
| pid| tid|level|              source|             message|               words|      filtered_words|
+----+----+-----+--------------------+--------------------+--------------------+--------------------+
|1795|1825|    I|PowerManager_scre...|DisplayPowerState...|[displaypowerstat...|[displaypowerstat...|
|5224|5283|    I|SendBroadcastPerm...|action:android.co...|[action:android.c...|[action:android.c...|
|1795|1825|    D|DisplayPowerContr...|Animating brightn...|[animating, brigh...|[animating, brigh...|
|1795|1825|    I|PowerManager_scre...|DisplayPowerContr...|[displaypowercont...|[displaypowercont...|
|1795|2750|    I|PowerManager_scre...|DisplayPowerState...|[displaypowerstat...|[displaypowerstat...|
|1795|2750|    I|     HwLightsService|back light level ...|[back, light, lev...|[back, light, lev...|
|1795|1825|    D|DisplayPowerContr...|Animating brightn...|[animating, brigh...|[a

In [None]:
hashing_tf = HashingTF(inputCol="filtered_words", outputCol="raw_features", numFeatures=10000)
tf_df = hashing_tf.transform(processed_df)

idf = IDF(inputCol="raw_features", outputCol="features")
idf_model = idf.fit(tf_df)
final_df = idf_model.transform(tf_df)



In [None]:
final_df.show()

+----+----+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
| pid| tid|level|              source|             message|               words|      filtered_words|        raw_features|            features|label|
+----+----+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
|1795|1825|    I|PowerManager_scre...|DisplayPowerState...|[displaypowerstat...|[displaypowerstat...|(10000,[1058,4665...|(10000,[1058,4665...| 24.0|
|5224|5283|    I|SendBroadcastPerm...|action:android.co...|[action:android.c...|[action:android.c...|(10000,[4799,9251...|(10000,[4799,9251...|  6.0|
|1795|1825|    D|DisplayPowerContr...|Animating brightn...|[animating, brigh...|[animating, brigh...|(10000,[2690,4142...|(10000,[2690,4142...| 59.0|
|1795|1825|    I|PowerManager_scre...|DisplayPowerContr...|[displaypowercont...|[displaypowercont...

In [None]:
indexer = StringIndexer(inputCol="source", outputCol="label")
final_df = indexer.fit(final_df).transform(final_df)



In [None]:
final_df.show()

+----+----+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
| pid| tid|level|              source|             message|               words|      filtered_words|        raw_features|            features|label|
+----+----+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+
|1795|1825|    I|PowerManager_scre...|DisplayPowerState...|[displaypowerstat...|[displaypowerstat...|(10000,[1058,4665...|(10000,[1058,4665...| 24.0|
|5224|5283|    I|SendBroadcastPerm...|action:android.co...|[action:android.c...|[action:android.c...|(10000,[4799,9251...|(10000,[4799,9251...|  6.0|
|1795|1825|    D|DisplayPowerContr...|Animating brightn...|[animating, brigh...|[animating, brigh...|(10000,[2690,4142...|(10000,[2690,4142...| 59.0|
|1795|1825|    I|PowerManager_scre...|DisplayPowerContr...|[displaypowercont...|[displaypowercont...

Podział danych za zbiór uczący i testowy w proporcji 80:20

In [None]:
train_df, test_df = final_df.randomSplit([0.8, 0.2], seed=42)


In [None]:
assembler = VectorAssembler(inputCols=["raw_features", "features", "label"], outputCol="new_label")


In [None]:
train_df = assembler.transform(train_df)


In [None]:
lr = LogisticRegression(featuresCol="features", labelCol="label", maxIter=10)


Dane zostają podzielone na 5 bucketów, na każdym zostaje wyszkolony model regresji liniowej. OStateczną decyzję na temat predykcji, modele podejmują poprzez głosowanie.

In [None]:
num_buckets=5

In [None]:
 train_df_with_id = final_df.withColumn("id", monotonically_increasing_id())
train_df_with_id.write.bucketBy(num_buckets, 'id').mode('overwrite').sortBy('id').saveAsTable('bucketed_table4')



In [None]:
models = []
for i in range(num_buckets):
    partition_df = train_df.rdd.zipWithIndex().filter(lambda x: x[1] % num_buckets == i).map(lambda x: x[0]).toDF()
    model = lr.fit(partition_df)
    models.append(model)

25/01/28 14:40:29 WARN DAGScheduler: Broadcasting large task binary with size 1470.6 KiB
25/01/28 14:40:54 WARN DAGScheduler: Broadcasting large task binary with size 1472.9 KiB
25/01/28 14:40:57 WARN DAGScheduler: Broadcasting large task binary with size 1510.7 KiB
25/01/28 14:41:21 WARN DAGScheduler: Broadcasting large task binary with size 1511.8 KiB
25/01/28 14:41:21 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
25/01/28 14:41:21 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
25/01/28 14:41:21 WARN DAGScheduler: Broadcasting large task binary with size 1511.3 KiB
25/01/28 14:41:48 WARN DAGScheduler: Broadcasting large task binary with size 1512.5 KiB
25/01/28 14:41:50 WARN DAGScheduler: Broadcasting large task binary with size 1511.4 KiB
25/01/28 14:41:55 WARN DAGScheduler: Broadcasting large task binary with size 1512.6 KiB
25/01/28 14:41:57 WARN DAGScheduler: Broadcasting large task binary wit

zapisanie wytrenowanych modeli

In [None]:
base_path = "models/"
for i, model in enumerate(models):
    model_path = f"{base_path}model_bucket_{i}"

    model.save(model_path)

25/01/28 14:59:09 WARN TaskSetManager: Stage 200 contains a task of very large size (106064 KiB). The maximum recommended task size is 1000 KiB.
25/01/28 14:59:20 WARN TaskSetManager: Stage 204 contains a task of very large size (106224 KiB). The maximum recommended task size is 1000 KiB.
25/01/28 14:59:31 WARN TaskSetManager: Stage 208 contains a task of very large size (105985 KiB). The maximum recommended task size is 1000 KiB.
25/01/28 14:59:35 WARN TaskSetManager: Stage 212 contains a task of very large size (105745 KiB). The maximum recommended task size is 1000 KiB.
25/01/28 14:59:40 WARN TaskSetManager: Stage 216 contains a task of very large size (105506 KiB). The maximum recommended task size is 1000 KiB.


In [None]:
loaded_models = []
for i in range(num_buckets):
    model_path = f"{base_path}model_bucket_{i}"  #
    loaded_model = LogisticRegressionModel.load(model_path)
    print(f"Model {i} został wczytany z {model_path}")



Model 0 został wczytany z models/model_bucket_0




Model 1 został wczytany z models/model_bucket_1




Model 2 został wczytany z models/model_bucket_2




Model 3 został wczytany z models/model_bucket_3




Model 4 został wczytany z models/model_bucket_4


                                                                                

Test działania modelu na zbiorze testowym

In [None]:
predictions = []
for model in models:
    pred = model.transform(test_df)
    predictions.append(pred)

In [None]:
final_predictions = predictions[0]
for pred in predictions[1:]:
    final_predictions = final_predictions.union(pred)


In [None]:
final_predictions.show(5)

25/01/28 15:05:45 WARN DAGScheduler: Broadcasting large task binary with size 507.2 MiB


+---+---+-----+------+-------+-----+--------------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|pid|tid|level|source|message|words|filtered_words|        raw_features|            features|label|       rawPrediction|         probability|prediction|
+---+---+-----+------+-------+-----+--------------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|   |   |     |      |       |   []|            []|(10000,[3372],[1.0])|(10000,[3372],[1....|  0.0|[14.6935155215089...|[0.99298496037442...|       0.0|
|   |   |     |      |       |   []|            []|(10000,[3372],[1.0])|(10000,[3372],[1....|  0.0|[14.6935155215089...|[0.99298496037442...|       0.0|
|   |   |     |      |       |   []|            []|(10000,[3372],[1.0])|(10000,[3372],[1....|  0.0|[14.6935155215089...|[0.99298496037442...|       0.0|
|   |   |     |      |       |   []|            []|(10000,[3372],[1.0])|(10000,[33

                                                                                

In [None]:
unique_predictions_count = final_predictions.select("prediction").distinct().count()

print(f"Liczba unikalnych predykcji: {unique_predictions_count}")

25/01/28 15:02:42 WARN DAGScheduler: Broadcasting large task binary with size 507.2 MiB
25/01/28 15:03:32 WARN DAGScheduler: Broadcasting large task binary with size 507.0 MiB


Liczba unikalnych predykcji: 1005


                                                                                

In [None]:
record_count = test_df.count()
record_count



310625

jak widać powstało 1005 etykiet, opisujacych 310625 logów. Na razie mają one wartości liczbowe i nie wiele mówią, jedna potencjalnie istnieje możliwośc rowoju projektu, poprzez dodanie zautomatyzowanego etykietowania.