Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2.7.2 release #85

Merged
merged 26 commits into from
Aug 31, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
d9fb100
better error handling
mukunku Jun 5, 2023
9993f4e
better list exception handling based on error logs
mukunku Jun 7, 2023
3c557c3
don't log invalid query exceptions
mukunku Jun 7, 2023
87532f5
bump assembly version
mukunku Jun 7, 2023
3db1357
better io exception handling
mukunku Jun 11, 2023
51704a5
exception cleanup
mukunku Jun 12, 2023
7ea551c
fix list type check
mukunku Aug 12, 2023
0389fe0
add column metadata to rowgroup metadata
mukunku Aug 12, 2023
5aa945a
loosen list schema validation
mukunku Aug 12, 2023
e1f60c8
update parquet.net library
mukunku Aug 12, 2023
0804119
remove statistics and encoding stats for now unless someone needs them
mukunku Aug 12, 2023
c379c99
intercept byte[] fields and render them as strings
mukunku Aug 17, 2023
d2ff2d5
bump assembly version to 2.7.2.1
mukunku Aug 17, 2023
40be835
some minor cleanup
mukunku Aug 17, 2023
330bc83
change default columns size mode to all cells
mukunku Aug 17, 2023
1faa99d
add copy raw button for thrift metadata and remove rowgroup details a…
mukunku Aug 17, 2023
5c4d7e2
update parquet.net package
mukunku Aug 17, 2023
29cb2e7
some cleanup
mukunku Aug 17, 2023
b3b2806
Update README.md
mukunku Aug 18, 2023
080226b
Update README.md
mukunku Aug 18, 2023
436992b
add fix for malformed datetime
mukunku Aug 30, 2023
51cc9fe
update packages and assembly version
mukunku Aug 30, 2023
2dd4a51
start tracking selfcontained executable usage
mukunku Aug 30, 2023
ca4b5ab
fix unit test
mukunku Aug 30, 2023
abfad6a
fix the test for realz this time
mukunku Aug 30, 2023
854bb0d
Add "SC" suffix to version number in about box for self contained dep…
mukunku Aug 30, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
{
public class UnsupportedFieldException : Exception
{
public UnsupportedFieldException(string fieldName, Exception? ex = null) : base(fieldName, ex)
public UnsupportedFieldException(string message, Exception? ex = null) : base(message, ex)
{

}
Expand Down
63 changes: 58 additions & 5 deletions src/ParquetViewer.Engine/ParquetEngine.Processor.cs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
using Parquet;
using Parquet.Meta;
using ParquetViewer.Engine.Exceptions;
using System.Collections;
using System.Data;
Expand Down Expand Up @@ -104,7 +105,7 @@ public async Task<DataTable> ReadRowsAsync(List<string> selectedFields, int offs
}
}

private static async Task ReadPrimitiveField(DataTable dataTable, ParquetRowGroupReader groupReader, int rowBeginIndex, ParquetSchemaElement field,
private async Task ReadPrimitiveField(DataTable dataTable, ParquetRowGroupReader groupReader, int rowBeginIndex, ParquetSchemaElement field,
long skipRecords, long readRecords, bool isFirstColumn, Dictionary<int, DataRow> rowLookupCache, CancellationToken cancellationToken, IProgress<int>? progress)
{
int rowIndex = rowBeginIndex;
Expand Down Expand Up @@ -146,13 +147,58 @@ public async Task<DataTable> ReadRowsAsync(List<string> selectedFields, int offs
}
}

datarow[fieldIndex] = value ?? DBNull.Value;
datarow[fieldIndex] = FixDateTime(value, field) ?? DBNull.Value;

rowIndex++;
progress?.Report(1);
}
}

/// <summary>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange that you need to work with such a patch...
Was it already here before?

Copy link
Owner Author

@mukunku mukunku Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I shared some details on your issue ticket, I believe the timestamp field is malformed which is why it's being shown as a epoch value instead of datetime.

I added a patch so we can still open such fields in the app for now. These types of inconsistencies tend to get resolved over time so I'm hoping that will be the case with this issue. I added a unit test to detect this as well.

We used to handle timestamp fields directly in the app but the parquet-dotnet library had added support for internally handling DateTime fields so we got rid of the logic from the app. But if the metadata is malformed of course that library doesn't handle it as a DateTime. So I added the old logic back for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I am wondering which package doesn't write away the metadata correctly.

It's made with pandas, but that is juste a wrapper to pyarrow - the python implementation of Apache Arrow, a package from the creators of Apache Parquet.

So strange... or it's on the parquet-dotnet side.

I'll see if I need to open another bug somewhere else.

Thanks

/// This is a patch fix to handle malformed datetime fields. We assume TIMESTAMP fields are DateTime values.
/// </summary>
/// <param name="value">Original value</param>
/// <param name="field">Schema element</param>
/// <returns>If the field is a timestamp, a DateTime object will be returned. Otherwise the value will not be changed.</returns>
private object? FixDateTime(object value, ParquetSchemaElement field)
{
if (!this.FixMalformedDateTime || value is null)
return value;

var timestampSchema = field.SchemaElement?.LogicalType?.TIMESTAMP;
if (timestampSchema is not null && field.SchemaElement?.ConvertedType is null)
{
long castValue;
if (field.DataField?.ClrType == typeof(long?))
{
castValue = ((long?)value).Value; //We know this isn't null from the null check above
}
else if (field.DataField?.ClrType == typeof(long))
{
castValue = (long)value;
}
else
{
throw new UnsupportedFieldException($"Field {field.Path} is not a valid timestamp field");
}

int divideBy = 0;
if (timestampSchema.Unit.NANOS != null)
divideBy = 1000 * 1000;
else if (timestampSchema.Unit.MICROS != null)
divideBy = 1000;
else if (timestampSchema.Unit.MILLIS != null)
divideBy = 1;

if (divideBy > 0)
value = DateTimeOffset.FromUnixTimeMilliseconds(castValue / divideBy).DateTime;
else //Not sure if this 'else' is correct but adding just in case
value = DateTimeOffset.FromUnixTimeSeconds(castValue);
}

return value;
}

private static async Task ReadListField(DataTable dataTable, ParquetRowGroupReader groupReader, int rowBeginIndex, ParquetSchemaElement field,
long skipRecords, long readRecords, bool isFirstColumn, Dictionary<int, DataRow> rowLookupCache, CancellationToken cancellationToken, IProgress<int>? progress)
{
Expand All @@ -162,7 +208,7 @@ public async Task<DataTable> ReadRowsAsync(List<string> selectedFields, int offs
{
itemField = listField.GetChildOrSingle("item"); //Not all parquet files follow the same format so we're being lax with getting the child here
}
catch(Exception ex)
catch (Exception ex)
{
throw new UnsupportedFieldException($"Cannot load field '{field.Path}. Invalid List type.'", ex);
}
Expand Down Expand Up @@ -312,14 +358,21 @@ private DataTable BuildDataTable(List<string> fields)
var schema = ParquetSchemaTree.GetChild(field);

DataColumn newColumn;
if (schema.SchemaElement.ConvertedType == Parquet.Meta.ConvertedType.LIST)
if (schema.SchemaElement.ConvertedType == ConvertedType.LIST)
{
newColumn = new DataColumn(field, typeof(ListValue));
}
else if (schema.SchemaElement.ConvertedType == Parquet.Meta.ConvertedType.MAP)
else if (schema.SchemaElement.ConvertedType == ConvertedType.MAP)
{
newColumn = new DataColumn(field, typeof(MapValue));
}
else if (this.FixMalformedDateTime
&& schema.SchemaElement.LogicalType?.TIMESTAMP is not null
&& schema.SchemaElement?.ConvertedType is null)
{
//Fix for malformed datetime fields (#88)
newColumn = new DataColumn(field, typeof(DateTime));
}
else
{
var clrType = schema.DataField?.ClrType ?? throw new Exception($"{field} has no data field");
Expand Down
2 changes: 2 additions & 0 deletions src/ParquetViewer.Engine/ParquetEngine.cs
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ public partial class ParquetEngine : IDisposable

public string OpenFileOrFolderPath { get; }

public bool FixMalformedDateTime { get; set; } = true;

private ParquetSchemaElement BuildParquetSchemaTree()
{
var thriftSchema = ThriftMetadata.Schema ?? throw new Exception("No thrift metadata was found");
Expand Down
Binary file not shown.
3 changes: 3 additions & 0 deletions src/ParquetViewer.Tests/ParquetViewer.Tests.csproj
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@
<None Update="Data\LIST_TYPE_TEST1.parquet">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
<None Update="Data\MALFORMED_DATETIME_TEST1.parquet">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
<None Update="Data\MAP_TYPE_TEST1.parquet">
<CopyToOutputDirectory>Always</CopyToOutputDirectory>
</None>
Expand Down
19 changes: 19 additions & 0 deletions src/ParquetViewer.Tests/SanityTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -243,5 +243,24 @@ public async Task NULLABLE_GUID_TEST1()
Assert.Equal(new Guid("0cf9cbfd-d320-45d7-b29f-9c2de1baa979"), dataTable.Rows[0][1]);
Assert.Equal(new DateTime(2019, 1, 1), dataTable.Rows[0][4]);
}

[Fact]
public async Task MALFORMED_DATETIME_TEST1()
{
using var parquetEngine = await ParquetEngine.OpenFileOrFolderAsync("Data/MALFORMED_DATETIME_TEST1.parquet", default);

var dataTable = await parquetEngine.ReadRowsAsync(parquetEngine.Fields, 0, int.MaxValue, default);
Assert.Equal(typeof(DateTime), dataTable.Rows[0]["ds"]?.GetType());

//Check if the malformed datetime still needs to be fixed
parquetEngine.FixMalformedDateTime = false;

dataTable = await parquetEngine.ReadRowsAsync(parquetEngine.Fields, 0, int.MaxValue, default);
if (dataTable.Rows[0]["ds"]?.GetType() == typeof(DateTime))
{
Assert.Fail("Looks like the Malformed DateTime Fix is no longer needed! Remove that part of the code.");
}
Assert.Equal(typeof(long), dataTable.Rows[0]["ds"]?.GetType()); //If it's not a datetime, then it should be a long.
}
}
}